This page contains the auto-generated API reference documentation.
Preprocessing Config¶
- class sequifier.config.preprocess_config.PreprocessorModel(*, project_path: str, data_path: str, read_format: str = 'csv', write_format: str = 'parquet', combine_into_single_file: bool = True, selected_columns: list[str] | None = None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int] | None = None, max_rows: int | None = None, seed: int, n_cores: int | None = None, batches_per_file: int = 1024, process_by_file: bool = True)[source]¶
Pydantic model for preprocessor configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- data_path¶
The path to the input data file.
- Type:
str
- read_format¶
The file type of the input data. Can be ‘csv’ or ‘parquet’.
- Type:
str
- write_format¶
The file type for the preprocessed output data.
- Type:
str
- combine_into_single_file¶
If True, combines all preprocessed data into a single file.
- Type:
bool
- selected_columns¶
A list of columns to be included in the preprocessing. If None, all columns are used.
- Type:
list[str] | None
- group_proportions¶
A list of floats that define the relative sizes of data splits (e.g., for train, validation, test). The sum of proportions must be 1.0.
- Type:
list[float]
- seq_length¶
The sequence length for the model inputs.
- Type:
int
- seq_step_sizes¶
A list of step sizes for creating subsequences within each data split.
- Type:
list[int] | None
- max_rows¶
The maximum number of input rows to process. If None, all rows are processed.
- Type:
int | None
- seed¶
A random seed for reproducibility.
- Type:
int
- n_cores¶
The number of CPU cores to use for parallel processing. If None, it uses the available CPU cores.
- Type:
int | None
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- process_by_file¶
A flag to indicate if processing should be done file by file.
- Type:
bool
Training Config¶
- class sequifier.config.train_config.ModelSpecModel(*, d_model: int, d_model_by_column: dict[str, int] | None = None, nhead: int, d_hid: int, nlayers: int)[source]¶
Pydantic model for model specifications.
- d_model¶
The number of expected features in the input.
- Type:
int
- d_model_by_column¶
The embedding dimensions for each input column. Must sum to d_model.
- Type:
dict[str, int] | None
- nhead¶
The number of heads in the multi-head attention models.
- Type:
int
- d_hid¶
The dimension of the feedforward network model.
- Type:
int
- nlayers¶
The number of layers in the transformer model.
- Type:
int
- class sequifier.config.train_config.TrainModel(*, project_path: str, ddconfig_path: str, model_name: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', selected_columns: list[str], column_types: dict[str, str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: int, n_classes: dict[str, int], inference_batch_size: int, seed: int, export_generative_model: bool, export_embedding_model: bool, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, model_spec: ModelSpecModel, training_spec: TrainingSpecModel)[source]¶
Pydantic model for training configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- ddconfig_path¶
The path to the data-driven configuration file.
- Type:
str
- model_name¶
The name of the model being trained.
- Type:
str
- training_data_path¶
The path to the training data.
- Type:
str
- validation_data_path¶
The path to the validation data.
- Type:
str
- read_format¶
The file format of the input data (e.g., ‘csv’, ‘parquet’).
- Type:
str
- selected_columns¶
The list of input columns to be used for training.
- Type:
list[str]
- column_types¶
A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).
- Type:
dict[str, str]
- categorical_columns¶
A list of columns that are categorical.
- Type:
list[str]
- real_columns¶
A list of columns that are real-valued.
- Type:
list[str]
- target_columns¶
The list of target columns for model training.
- Type:
list[str]
- target_column_types¶
A dictionary mapping target columns to their types (‘categorical’ or ‘real’).
- Type:
dict[str, str]
- id_maps¶
For each categorical column, a map from distinct values to their indexed representation.
- Type:
dict[str, dict[str | int, int]]
- seq_length¶
The sequence length of the model’s input.
- Type:
int
- n_classes¶
The number of classes for each categorical column.
- Type:
dict[str, int]
- inference_batch_size¶
The batch size to be used for inference after model export.
- Type:
int
- seed¶
The random seed for numpy and PyTorch.
- Type:
int
- export_generative_model¶
If True, exports the generative model.
- Type:
bool
- export_embedding_model¶
If True, exports the embedding model.
- Type:
bool
- export_onnx¶
If True, exports the model in ONNX format.
- Type:
bool
- export_pt¶
If True, exports the model using torch.save.
- Type:
bool
- export_with_dropout¶
If True, exports the model with dropout enabled.
- Type:
bool
- model_spec¶
The specification of the transformer model architecture.
- training_spec¶
The specification of the training run configuration.
- class sequifier.config.train_config.TrainingSpecModel(*, device: str, device_max_concat_length: int = 12, epochs: int, log_interval: int = 10, class_share_log_columns: list[str] = None, early_stopping_epochs: int | None = None, iter_save: int, batch_size: int, lr: float, criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: int | None = None, dropout: float = 0.0, loss_weights: dict[str, float] | None = None, optimizer: DotDict = None, scheduler: DotDict = None, continue_training: bool = True, distributed: bool = False, load_full_data_to_ram: bool = True, world_size: int = 1, num_workers: int = 0, backend: str = 'nccl')[source]¶
Pydantic model for training specifications.
- device¶
The torch.device to train the model on (e.g., ‘cuda’, ‘cpu’, ‘mps’).
- Type:
str
- device_max_concat_length¶
Maximum sequence length for concatenation on device.
- Type:
int
- epochs¶
The total number of epochs to train for.
- Type:
int
- log_interval¶
The interval in batches for logging.
- Type:
int
A list of column names for which to log the class share of predictions.
- Type:
list[str]
- early_stopping_epochs¶
Number of epochs to wait for validation loss improvement before stopping.
- Type:
int | None
- iter_save¶
The interval in epochs for checkpointing the model.
- Type:
int
- batch_size¶
The training batch size.
- Type:
int
- lr¶
The learning rate.
- Type:
float
- criterion¶
A dictionary mapping each target column to a loss function.
- Type:
dict[str, str]
- class_weights¶
A dictionary mapping categorical target columns to a list of class weights.
- Type:
dict[str, list[float]] | None
- accumulation_steps¶
The number of gradient accumulation steps.
- Type:
int | None
- dropout¶
The dropout value for the transformer model.
- Type:
float
- loss_weights¶
A dictionary mapping columns to specific loss weights.
- Type:
dict[str, float] | None
- optimizer¶
The optimizer configuration.
- Type:
sequifier.config.train_config.DotDict
- scheduler¶
The learning rate scheduler configuration.
- Type:
sequifier.config.train_config.DotDict
- continue_training¶
If True, continue training from the latest checkpoint.
- Type:
bool
- distributed¶
If True, enables distributed training.
- Type:
bool
- load_full_data_to_ram¶
If True, loads the entire dataset into RAM.
- Type:
bool
- world_size¶
The number of processes for distributed training.
- Type:
int
- num_workers¶
The number of worker threads for data loading.
- Type:
int
- backend¶
The distributed training backend (e.g., ‘nccl’).
- Type:
str
Inference Config¶
- class sequifier.config.infer_config.InfererModel(*, project_path: str, ddconfig_path: str, model_path: str | list[str], model_type: str, data_path: str, training_config_path: str = 'configs/train.yaml', read_format: str = 'parquet', write_format: str = 'csv', selected_columns: list[str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], column_types: dict[str, str], target_column_types: dict[str, str], output_probabilities: bool = False, map_to_id: bool = True, seed: int, device: str, seq_length: int, inference_batch_size: int, distributed: bool = False, load_full_data_to_ram: bool = True, world_size: int = 1, num_workers: int = 0, sample_from_distribution_columns: list[str] | None = None, infer_with_dropout: bool = False, autoregression: bool = False, autoregression_extra_steps: int | None = None)[source]¶
Pydantic model for inference configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- ddconfig_path¶
The path to the data-driven configuration file.
- Type:
str
- model_path¶
The path to the trained model file(s).
- Type:
str | list[str]
- model_type¶
The type of model, either ‘embedding’ or ‘generative’.
- Type:
str
- data_path¶
The path to the data to be used for inference.
- Type:
str
- training_config_path¶
The path to the training configuration file.
- Type:
str
- read_format¶
The file format of the input data (e.g., ‘csv’, ‘parquet’).
- Type:
str
- write_format¶
The file format for the inference output.
- Type:
str
- selected_columns¶
The list of input columns used for inference.
- Type:
list[str]
- categorical_columns¶
A list of columns that are categorical.
- Type:
list[str]
- real_columns¶
A list of columns that are real-valued.
- Type:
list[str]
- target_columns¶
The list of target columns for inference.
- Type:
list[str]
- column_types¶
A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).
- Type:
dict[str, str]
- target_column_types¶
A dictionary mapping target columns to their types (‘categorical’ or ‘real’).
- Type:
dict[str, str]
- output_probabilities¶
If True, outputs the probability distributions for categorical target columns.
- Type:
bool
- map_to_id¶
If True, maps categorical output values back to their original IDs.
- Type:
bool
- seed¶
The random seed for reproducibility.
- Type:
int
- device¶
The device to run inference on (e.g., ‘cuda’, ‘cpu’, ‘mps’).
- Type:
str
- seq_length¶
The sequence length of the model’s input.
- Type:
int
- inference_batch_size¶
The batch size for inference.
- Type:
int
- distributed¶
If True, enables distributed inference.
- Type:
bool
- load_full_data_to_ram¶
If True, loads the entire dataset into RAM.
- Type:
bool
- world_size¶
The number of processes for distributed inference.
- Type:
int
- num_workers¶
The number of worker threads for data loading.
- Type:
int
- sample_from_distribution_columns¶
A list of columns from which to sample from the distribution.
- Type:
list[str] | None
- infer_with_dropout¶
If True, applies dropout during inference.
- Type:
bool
- autoregression¶
If True, performs autoregressive inference.
- Type:
bool
- autoregression_extra_steps¶
The number of additional steps for autoregressive inference.
- Type:
int | None
Hyperparameter Search Config¶
- class sequifier.config.hyperparameter_search_config.HyperparameterSearch(*, project_path: str, ddconfig_path: str, hp_search_name: str, search_strategy: str = 'sample', n_samples: int | None = None, model_config_write_path: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', selected_columns: list[list[str]], column_types: list[dict[str, str]], categorical_columns: list[list[str]], real_columns: list[list[str]], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: list[int], n_classes: dict[str, int], inference_batch_size: int, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, model_hyperparameter_sampling: ModelSpecHyperparameterSampling, training_hyperparameter_sampling: TrainingSpecHyperparameterSampling)[source]¶
Pydantic model for hyperparameter search configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- ddconfig_path¶
The path to the data-driven configuration file.
- Type:
str
- hp_search_name¶
The name for the hyperparameter search.
- Type:
str
- search_strategy¶
The search strategy, either “sample” or “grid”.
- Type:
str
- n_samples¶
The number of samples to draw for the search.
- Type:
int | None
- model_config_write_path¶
The path to write the model configurations to.
- Type:
str
- training_data_path¶
The path to the training data.
- Type:
str
- validation_data_path¶
The path to the validation data.
- Type:
str
- read_format¶
The file format of the input data.
- Type:
str
- selected_columns¶
A list of lists of columns to be used for training.
- Type:
list[list[str]]
- column_types¶
A list of dictionaries mapping columns to their types.
- Type:
list[dict[str, str]]
- categorical_columns¶
A list of lists of categorical columns.
- Type:
list[list[str]]
- real_columns¶
A list of lists of real-valued columns.
- Type:
list[list[str]]
- target_columns¶
The list of target columns for model training.
- Type:
list[str]
- target_column_types¶
A dictionary mapping target columns to their types.
- Type:
dict[str, str]
- id_maps¶
A dictionary mapping categorical values to their indexed representation.
- Type:
dict[str, dict[str | int, int]]
- seq_length¶
A list of possible sequence lengths.
- Type:
list[int]
- n_classes¶
The number of classes for each categorical column.
- Type:
dict[str, int]
- inference_batch_size¶
The batch size for inference.
- Type:
int
- export_onnx¶
If True, exports the model in ONNX format.
- Type:
bool
- export_pt¶
If True, exports the model using torch.save.
- Type:
bool
- export_with_dropout¶
If True, exports the model with dropout enabled.
- Type:
bool
- model_hyperparameter_sampling¶
The sampling configuration for model hyperparameters.
- training_hyperparameter_sampling¶
The sampling configuration for training hyperparameters.
- grid_sample(i)[source]¶
Select a full training configuration based on a grid search index.
This method generates a grid of all possible configurations and selects the configuration at the given index.
- Parameters:
i – The index of the configuration to select from the grid.
- Returns:
A TrainModel instance populated with the selected configuration.
- n_combinations()[source]¶
Calculate the total number of possible configurations.
This method computes the total number of unique configurations that can be generated by a grid search over all defined hyperparameters.
- Returns:
The total number of possible hyperparameter configurations.
- random_sample(i)[source]¶
Randomly sample a full training configuration.
This method generates a complete training configuration by randomly sampling model and training hyperparameters, as well as selecting a column set and sequence length.
- Parameters:
i – The index of the sample, used to create a unique model name.
- Returns:
A TrainModel instance populated with a randomly sampled configuration.
- sample(i)[source]¶
Sample a configuration based on the specified search strategy.
This method delegates to either random_sample or grid_sample based on the search_strategy attribute.
- Parameters:
i – The index of the sample or grid combination to generate.
- Returns:
A TrainModel instance with a generated configuration.
- Raises:
Exception – If the search_strategy is not ‘sample’ or ‘grid’.
- class sequifier.config.hyperparameter_search_config.ModelSpecHyperparameterSampling(*, d_model: list[int], d_model_by_column: list[dict[str, int]] | None = None, nhead: list[int], d_hid: list[int], nlayers: list[int])[source]¶
Pydantic model for model specification hyperparameter sampling.
- d_model¶
A list of possible numbers of expected features in the input.
- Type:
list[int]
- d_model_by_column¶
A list of possible embedding dimensions for each input column.
- Type:
list[dict[str, int]] | None
- nhead¶
A list of possible numbers of heads in the multi-head attention models.
- Type:
list[int]
- d_hid¶
A list of possible dimensions of the feedforward network model.
- Type:
list[int]
- nlayers¶
A list of possible numbers of layers in the transformer model.
- Type:
list[int]
- grid_sample(i)[source]¶
Select a set of model hyperparameters based on a grid search index.
This method generates a grid of all possible model hyperparameter combinations and selects the combination at the given index.
- Parameters:
i – The index of the hyperparameter combination to select from the grid.
- Returns:
A ModelSpecModel instance populated with the selected set of hyperparameters.
- n_combinations()[source]¶
Calculate the total number of model hyperparameter combinations.
This method computes the total number of unique model hyperparameter sets that can be generated by the grid search.
- Returns:
The total number of possible model hyperparameter combinations.
- random_sample()[source]¶
Randomly sample a set of model hyperparameters.
This method selects a random combination of model hyperparameters from the defined lists of possibilities. It ensures that d_model, d_model_by_column, and nhead are paired correctly.
- Returns:
A ModelSpecModel instance populated with a randomly sampled set of hyperparameters.
- class sequifier.config.hyperparameter_search_config.TrainingSpecHyperparameterSampling(*, device: str, epochs: list[int], log_interval: int = 10, class_share_log_columns: list[str] = None, early_stopping_epochs: int | None = None, iter_save: int, batch_size: list[int], lr: list[float], criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: list[int], dropout: list[float] = [0.0], loss_weights: dict[str, float] | None = None, optimizer: list[DotDict] = None, scheduler: list[DotDict] = None, continue_training: bool = True)[source]¶
Pydantic model for training specification hyperparameter sampling.
- device¶
The device to train on (e.g., ‘cuda’, ‘cpu’).
- Type:
str
- epochs¶
A list of possible numbers of epochs to train for.
- Type:
list[int]
- log_interval¶
The interval in batches for logging.
- Type:
int
Columns for which to log class share.
- Type:
list[str]
- early_stopping_epochs¶
Number of epochs for early stopping.
- Type:
int | None
- iter_save¶
Interval in epochs for saving model checkpoints.
- Type:
int
- batch_size¶
A list of possible batch sizes.
- Type:
list[int]
- lr¶
A list of possible learning rates.
- Type:
list[float]
- criterion¶
A dictionary mapping target columns to loss functions.
- Type:
dict[str, str]
- class_weights¶
Optional dictionary mapping columns to class weights.
- Type:
dict[str, list[float]] | None
- accumulation_steps¶
A list of possible gradient accumulation steps.
- Type:
list[int]
- dropout¶
A list of possible dropout rates.
- Type:
list[float]
- loss_weights¶
Optional dictionary mapping columns to loss weights.
- Type:
dict[str, float] | None
- optimizer¶
A list of possible optimizer configurations.
- Type:
list[sequifier.config.train_config.DotDict]
- scheduler¶
A list of possible scheduler configurations.
- Type:
list[sequifier.config.train_config.DotDict]
- continue_training¶
Flag to continue training from a checkpoint.
- Type:
bool
- __init__(**kwargs)[source]¶
Initialize the TrainingSpecHyperparameterSampling instance.
This method initializes the Pydantic BaseModel and then processes the optimizer and scheduler configurations from the provided keyword arguments, converting them into DotDict objects.
- Parameters:
**kwargs – Keyword arguments that correspond to the attributes of this class. The ‘optimizer’ and ‘scheduler’ arguments are expected to be lists of dictionaries.
- grid_sample(i)[source]¶
Select a set of training hyperparameters based on a grid search index.
This method generates a grid of all possible hyperparameter combinations and selects the combination at the given index.
- Parameters:
i – The index of the hyperparameter combination to select from the grid.
- Returns:
A TrainingSpecModel instance populated with the selected set of hyperparameters.
- n_combinations()[source]¶
Calculate the total number of hyperparameter combinations.
This method computes the total number of unique hyperparameter sets that can be generated by the grid search.
- Returns:
The total number of possible hyperparameter combinations.
- random_sample()[source]¶
Randomly sample a set of training hyperparameters.
This method selects a random combination of hyperparameters from the defined lists of possibilities. It ensures that learning rates and schedulers are paired correctly.
- Returns:
A TrainingSpecModel instance populated with a randomly sampled set of hyperparameters.
Non-standard Optimizers¶
- class sequifier.optimizers.ademamix.AdEMAMix(params={}, lr=0.001, betas=(0.9, 0.999, 0.9999), eps=1e-08, weight_decay=0, alpha=5.0, T_alpha_beta3=None)[source]¶
Implements the AdEMAMix optimizer.
This optimizer is based on the paper “AdEMAMix: A Novel Adaptive Optimizer for Deep Learning”. It combines the advantages of Adam and EMA, and introduces a mixing term to further improve performance.
- Parameters:
params (iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – Learning rate (default: 1e-3).
betas (Tuple[float, float, float], optional) – Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999, 0.9999)).
eps (float, optional) – Term added to the denominator to improve numerical stability (default: 1e-8).
weight_decay (float, optional) – Weight decay (L2 penalty) (default: 0).
alpha (float, optional) – Mixing coefficient (default: 5.0).
T_alpha_beta3 (int, optional) – Time period for alpha and beta3 scheduling (default: None).
Internals¶
- sequifier.sequifier.build_args_config(args: Any) dict[str, Any][source]¶
Build configuration dictionary from command-line arguments.
- Parameters:
args – Parsed command-line arguments.
- Returns:
Dictionary containing configuration options.
- sequifier.sequifier.setup_parser() ArgumentParser[source]¶
Set up the argument parser for the command-line interface.
- Returns:
Configured ArgumentParser object.
- class sequifier.preprocess.Preprocessor(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]¶
A class for preprocessing data for the sequifier model.
This class handles loading, preprocessing, and saving data. It supports single-file and multi-file processing, and can handle large datasets by processing them in batches.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- data_name_root¶
The root name of the data file.
- Type:
str
- combine_into_single_file¶
Whether to combine the output into a single file.
- Type:
bool
- target_dir¶
The target directory for temporary files.
- Type:
str
- seed¶
The random seed for reproducibility.
- Type:
int
- n_cores¶
The number of cores to use for parallel processing.
- Type:
int
- split_paths¶
The paths to the output split files.
- Type:
list[str]
- __init__(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]¶
Initializes the Preprocessor with the given parameters.
- Parameters:
project_path – The path to the sequifier project directory.
data_path – The path to the input data file.
read_format – The file type of the input data.
write_format – The file type for the preprocessed output data.
combine_into_single_file – Whether to combine the output into a single file.
selected_columns – A list of columns to be included in the preprocessing.
group_proportions – A list of floats that define the relative sizes of data splits.
seq_length – The sequence length for the model inputs.
seq_step_sizes – A list of step sizes for creating subsequences.
max_rows – The maximum number of input rows to process.
seed – A random seed for reproducibility.
n_cores – The number of CPU cores to use for parallel processing.
batches_per_file – The number of batches to process per file.
process_by_file – A flag to indicate if processing should be done file by file.
- sequifier.preprocess.cast_columns_to_string(data: DataFrame) DataFrame[source]¶
Casts the column names of a Polars DataFrame to strings.
This is often necessary because Polars schemas may use integers as column names (e.g., ‘0’, ‘1’, ‘2’…) which need to be strings for some operations.
- Parameters:
data – The Polars DataFrame.
- Returns:
The same DataFrame with its columns attribute modified.
- sequifier.preprocess.combine_maps(map1: dict[str | int, int], map2: dict[str | int, int]) dict[str | int, int][source]¶
Combines two ID maps into a new, consolidated map.
Takes all unique keys from both map1 and map2, sorts them, and creates a new, single map where keys are mapped to 1-based indices based on the sorted order. This ensures a consistent mapping across different data chunks.
- Parameters:
map1 – The first ID map.
map2 – The second ID map.
- Returns:
A new, combined, and re-indexed ID map.
- sequifier.preprocess.combine_multiprocessing_outputs(project_path: str, target_dir: str, n_splits: int, input_files: dict[int, list[str]], dataset_name: str, write_format: str, in_target_dir: bool = False, pre_split_str: str | None = None, post_split_str: str | None = None) None[source]¶
Combines multiple intermediate batch files into final split files.
This function iterates through each split and combines all the intermediate files listed in input_files[split] into a single final output file for that split.
For “csv” format, it uses the csvstack command-line utility.
For “parquet” format, it uses pyarrow.parquet.ParquetWriter to concatenate the files efficiently.
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory containing intermediate files.
n_splits – The number of data splits.
input_files – A dictionary mapping split index (int) to a list of input file paths (str) for that split.
dataset_name – The root name for the final output files.
write_format – The file format (“csv” or “parquet”).
in_target_dir – If True, the final combined file is written inside target_dir. If False, it’s written to data/.
pre_split_str – An optional string to insert into the filename before the “-split{i}” part.
post_split_str – An optional string to insert into the filename after the “-split{i}” part.
- sequifier.preprocess.combine_parquet_files(files: list[str], out_path: str) None[source]¶
Combines multiple Parquet files into a single Parquet file.
This function reads the schema from the first file and uses it to initialize a ParquetWriter. It then iterates through all files in the list, reading each one as a table and writing it to the new combined file. This is more memory-efficient than reading all files into one large table first.
- Parameters:
files – A list of paths to the Parquet files to combine.
out_path – The path for the combined output Parquet file.
- sequifier.preprocess.create_file_paths_for_multiple_files1(project_path: str, target_dir: str, n_splits: int, n_batches: int, process_id: int, file_index: int, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a specific data file.
This is used in the multi-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches before they are combined.
The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}-{batch_id}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of batches created by the process.
process_id – The ID of the multiprocessing worker.
file_index – The index of the file being processed by this worker.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_file_paths_for_multiple_files2(project_path: str, target_dir: str, n_splits: int, n_processes: int, n_files: dict[int, int], dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of intermediate file paths for a multi-file run.
This is used in the multi-file, combine_into_single_file=True workflow. It generates the file paths for the combined files from each process, which are the inputs to the final combination step.
The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory where files are located.
n_splits – The number of data splits.
n_processes – The total number of multiprocessing workers.
n_files – A dictionary mapping process_id to the number of files that process handled.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of all intermediate combined file paths (str) for that split.
- sequifier.preprocess.create_file_paths_for_single_file(project_path: str, target_dir: str, n_splits: int, n_batches: int, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a single-file run.
This is used in the single-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches created by different processes before they are combined.
The naming pattern is: {dataset_name}-split{split}-{core_id}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of processes (batches) running in parallel.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_id_map(data: DataFrame, column: str) dict[str | int, int][source]¶
Creates a map from unique values in a column to integer indices.
Finds all unique values in the specified column of the data DataFrame, sorts them, and creates a dictionary mapping each unique value to a 1-based integer index.
- Parameters:
data – The Polars DataFrame containing the column.
column – The name of the column to map.
- Returns:
A dictionary mapping unique values (str or int) to an integer index (starting from 1).
- sequifier.preprocess.delete_files(files: list[str] | dict[int, list[str]]) None[source]¶
Deletes a list of files from the filesystem.
- Parameters:
files – A list of file paths to delete, or a dictionary whose values are lists of file paths to delete.
- sequifier.preprocess.extract_sequences(data: DataFrame, schema: Any, seq_length: int, seq_step_size: int, columns: list[str]) DataFrame[source]¶
Extracts subsequences from a DataFrame of full sequences.
This function takes a DataFrame where each row contains all items for a single sequenceId. It iterates through each sequenceId, extracts all possible subsequences of seq_length using the specified seq_step_size, calculates the starting position of each subsequence within the original sequence, and formats them into a new, long-format DataFrame that conforms to the provided schema.
- Parameters:
data – The input Polars DataFrame, grouped by “sequenceId”.
schema – The schema for the output long-format DataFrame.
seq_length – The length of the subsequences to extract.
seq_step_size – The step size to use when sliding the window to create subsequences.
columns – A list of the data column names (features) to extract.
- Returns:
A new, long-format Polars DataFrame containing the extracted subsequences, matching the provided schema. Includes columns for sequenceId, subsequenceId, startItemPosition, inputCol, and the sequence items (‘0’, ‘1’, …).
- sequifier.preprocess.extract_subsequences(in_seq: dict[str, list], seq_length: int, seq_step_size: int, columns: list[str]) dict[str, list[list[float | int]]][source]¶
Extracts subsequences from a dictionary of sequence lists.
This function takes a dictionary in_seq where keys are column names and values are lists of items for a single full sequence. It first pads the sequences with 0s at the beginning if they are shorter than seq_length. Then, it calculates the subsequence start indices using get_subsequence_starts and extracts all subsequences.
- Parameters:
in_seq – A dictionary mapping column names to lists of items (e.g., {‘col_A’: [1, 2, 3, 4, 5], ‘col_B’: [6, 7, 8, 9, 10]}).
seq_length – The length of the subsequences to extract.
seq_step_size – The desired step size between subsequences.
columns – A list of the column names (keys in in_seq) to process.
- Returns:
A dictionary mapping column names to a list of lists, where each inner list is a subsequence.
- sequifier.preprocess.get_batch_limits(data: DataFrame, n_batches: int) list[tuple[int, int]][source]¶
Calculates row indices to split a DataFrame into batches.
This function divides the DataFrame into n_batches roughly equal chunks. Crucially, it ensures that no sequenceId is split across two different batches. It does this by finding the ideal split points and then adjusting them to the nearest sequenceId boundary.
- Parameters:
data – The DataFrame to split. Must be sorted by “sequenceId”.
n_batches – The desired number of batches.
- Returns:
A list of (start_index, end_index) tuples, where each tuple defines the row indices for a batch.
- sequifier.preprocess.get_combined_statistics(n1: int, mean1: float, std1: float, n2: int, mean2: float, std2: float) tuple[float, float][source]¶
Calculates the combined mean and standard deviation of two data subsets.
Uses a stable parallel algorithm (related to Welford’s algorithm) to combine statistics from two subsets without needing the original data.
- Parameters:
n1 – Number of samples in subset 1.
mean1 – Mean of subset 1.
std1 – Standard deviation of subset 1.
n2 – Number of samples in subset 2.
mean2 – Mean of subset 2.
std2 – Standard deviation of subset 2.
- Returns:
A tuple (combined_mean, combined_std) containing the combined mean and standard deviation of the two subsets.
- sequifier.preprocess.get_group_bounds(data_subset: DataFrame, group_proportions: list[float])[source]¶
Calculates row indices for splitting a sequence into groups.
This function takes a DataFrame data_subset (which typically contains all items for a single sequenceId) and calculates the row indices to split it into multiple groups (e.g., train, val, test) based on the provided group_proportions.
- Parameters:
data_subset – The DataFrame (for a single sequence) to split.
group_proportions – A list of floats (e.g., [0.8, 0.1, 0.1]) that sum to 1.0, defining the relative sizes of the splits.
- Returns:
A list of (start_index, end_index) tuples, one for each proportion, defining the row slices for each group.
- sequifier.preprocess.get_subsequence_starts(in_seq_length: int, seq_length: int, seq_step_size: int) ndarray[source]¶
Calculates the start indices for extracting subsequences.
This function determines the starting indices for sliding a window of seq_length over an input sequence of in_seq_length. It aims to use seq_step_size, but adjusts the step size slightly to ensure that the windows are distributed as evenly as possible and cover the full sequence from the beginning to the end.
- Parameters:
in_seq_length – The length of the original input sequence.
seq_length – The length of the subsequences to extract.
seq_step_size – The desired step size between subsequences.
- Returns:
A numpy array of integer start indices for each subsequence.
- sequifier.preprocess.insert_top_folder(path: str, folder_name: str) str[source]¶
Inserts a directory name into a file path, just before the filename.
Example
insert_top_folder(“a/b/c.txt”, “temp”) returns “a/b/temp/c.txt”
- Parameters:
path – The original file path.
folder_name – The name of the folder to insert.
- Returns:
The new path string with the folder inserted.
- sequifier.preprocess.preprocess(args: Any, args_config: dict[str, Any]) None[source]¶
Runs the main data preprocessing pipeline.
This function loads the preprocessing configuration, initializes the Preprocessor class, and executes the preprocessing steps based on the loaded configuration.
- Parameters:
args – An object containing command-line arguments. Expected to have a config_path attribute specifying the path to the YAML configuration file.
args_config – A dictionary containing additional configuration parameters that may override or supplement the settings loaded from the config file.
- sequifier.preprocess.preprocess_batch(project_path: str, data_name_root: str, process_id: int, batch: DataFrame, schema: Any, split_paths: list[str], seq_length: int, seq_step_sizes: list[int], data_columns: list[str], col_types: dict[str, str], group_proportions: list[float], target_dir: str, write_format: str, batches_per_file: int) None[source]¶
Processes a batch of data.
- Parameters:
project_path – The path to the sequifier project directory.
data_name_root – The root name of the data file.
process_id – The id of the process.
batch – The batch of data to process.
schema – The schema for the preprocessed data.
split_paths – The paths to the output split files.
seq_length – The sequence length for the model inputs.
seq_step_sizes – A list of step sizes for creating subsequences.
data_columns – A list of data columns.
col_types – A dictionary containing the column types.
group_proportions – A list of floats that define the relative sizes of data splits.
target_dir – The target directory for temporary files.
write_format – The file format for the output files.
batches_per_file – The number of batches to process per file.
- sequifier.preprocess.process_and_write_data_pt(data: DataFrame, seq_length: int, path: str, column_types: dict[str, str])[source]¶
Processes the sequence DataFrame and writes it to a .pt file.
This function takes the long-format sequence DataFrame (data), aggregates it by sequenceId and subsequenceId, and pivots it so that each inputCol becomes its own column containing a list of sequence items. It also extracts the startItemPosition.
It then converts these lists into NumPy arrays, splits them into sequences (all but last item) and targets (all but first item), and converts them to PyTorch tensors along with sequence/subsequence IDs and start positions. The final data tuple (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor) is saved to a .pt file using torch.save.
- Parameters:
data – The long-format Polars DataFrame of extracted sequences.
seq_length – The total sequence length (N). The resulting tensors will have sequence length N-1.
path – The output file path (e.g., “data/batch_0.pt”).
column_types – A dictionary mapping column names to their string data types, used to determine the correct torch dtype.
- class sequifier.train.TransformerEmbeddingModel(transformer_model: TransformerModel)[source]¶
A wrapper around the TransformerModel to expose the embedding functionality.
- __init__(transformer_model: TransformerModel)[source]¶
Initializes the TransformerEmbeddingModel.
- Parameters:
transformer_model – The TransformerModel to wrap.
- class sequifier.train.TransformerModel(hparams: Any, rank: int | None = None)[source]¶
The main Transformer model for the sequifier.
This class implements the Transformer model, including the training and evaluation loops, as well as the export functionality.
- __init__(hparams: Any, rank: int | None = None)[source]¶
Initializes the TransformerModel.
Based on the hyperparameters, this initializes: - Embeddings for categorical and real features (self.encoder) - Positional encoders (self.pos_encoder) - The main TransformerEncoder (self.transformer_encoder) - Output decoders for each target column (self.decoder) - Loss functions (self.criterion) - Optimizer (self.optimizer) and scheduler (self.scheduler)
- Parameters:
hparams – The hyperparameters for the model (e.g., from TrainModel config).
rank – The rank of the current process (for distributed training).
- apply_softmax(target_column: str, output: Tensor) Tensor[source]¶
Applies softmax to the output of the decoder.
If the target is real, it returns the output unchanged. If the target is categorical, it applies LogSoftmax.
- Parameters:
target_column – The name of the target column.
output – The decoded output tensor (logits or real value).
- Returns:
The output tensor, with LogSoftmax applied if categorical.
- decode(target_column: str, output: Tensor) Tensor[source]¶
Decodes the output of the transformer encoder.
Applies the appropriate final linear layer for a given target column.
- Parameters:
target_column – The name of the target column to decode.
output – The raw output tensor from the TransformerEncoder (seq_length, batch_size, d_model).
- Returns:
The decoded output (logits or real value) for the target column (seq_length, batch_size, n_classes/1).
- forward(src: dict[str, Tensor]) dict[str, Tensor][source]¶
The main forward pass of the model.
This is typically used for inference/evaluation, returning the probabilities/values for the last token in the sequence.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
A dictionary mapping target column names to their final output (LogSoftmax probabilities or real values) for the last token (batch_size, n_classes/1).
- forward_embed(src: dict[str, Tensor]) Tensor[source]¶
Forward pass for the embedding model.
This returns only the embedding from the last token in the sequence.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
The embedding tensor for the last token (batch_size, d_model).
- forward_inner(src: dict[str, Tensor]) Tensor[source]¶
The inner forward pass of the model.
This handles embedding lookup, positional encoding, and passing the combined tensor through the transformer encoder.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
The raw output tensor from the TransformerEncoder (seq_length, batch_size, d_model).
- forward_train(src: dict[str, Tensor]) dict[str, Tensor][source]¶
Forward pass for training.
This runs the inner forward pass and then applies the appropriate decoder for each target column.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
A dictionary mapping target column names to their raw output (logit) tensors (seq_length, batch_size, n_classes/1).
- train_model(train_loader: DataLoader, valid_loader: DataLoader, train_sampler: RandomSampler | DistributedSampler | DistributedGroupedRandomSampler | None, valid_sampler: RandomSampler | DistributedSampler | DistributedGroupedRandomSampler | None) None[source]¶
Trains the model.
This method contains the main training loop, including epoch iteration, validation, early stopping logic, and model saving/exporting.
- Parameters:
train_loader – DataLoader for the training dataset.
valid_loader – DataLoader for the validation dataset.
train_sampler – Sampler for the training DataLoader, used to set the epoch in distributed training.
valid_sampler – Sampler for the validation DataLoader, used to set the epoch in distributed training.
- sequifier.train.format_number(number: int | float | float32) str[source]¶
Format a number for display.
- Parameters:
number – The number to format.
- Returns:
A formatted string representation of the number.
- sequifier.train.infer_with_embedding_model(model: Module, x: list[dict[str, ndarray]], device: str, size: int, target_columns: list[str]) ndarray[source]¶
Performs inference with an embedding model.
- Parameters:
model – The loaded TransformerEmbeddingModel.
x – A list of input data dictionaries (batched).
device – The device to run inference on.
size – The total number of samples (unused in this function).
target_columns – List of target column names (unused in this function).
- Returns:
A NumPy array containing the concatenated embeddings from all batches.
- sequifier.train.infer_with_generative_model(model: Module, x: list[dict[str, ndarray]], device: str, size: int, target_columns: list[str]) dict[str, ndarray][source]¶
Performs inference with a generative model.
- Parameters:
model – The loaded TransformerModel.
x – A list of input data dictionaries (batched).
device – The device to run inference on.
size – The total number of samples to trim the final output to.
target_columns – List of target column names to extract from the output.
- Returns:
A dictionary mapping target column names to their concatenated output NumPy arrays, trimmed to size.
- sequifier.train.load_inference_model(model_type: str, model_path: str, training_config_path: str, args_config: dict[str, Any], device: str, infer_with_dropout: bool) Module[source]¶
Loads a trained model for inference.
- Parameters:
model_type – “generative” or “embedding”.
model_path – Path to the saved .pt model file.
training_config_path – Path to the .yaml config file used for training.
args_config – A dictionary of override configurations.
device – The device to load the model onto (e.g., “cuda”, “cpu”).
infer_with_dropout – Whether to force dropout layers to be active during inference.
- Returns:
The loaded and compiled torch.nn.Module (TransformerModel or TransformerEmbeddingModel) in evaluation mode.
- sequifier.train.setup(rank: int, world_size: int, backend: str = 'nccl')[source]¶
Sets up the distributed training environment.
- Parameters:
rank – The rank of the current process.
world_size – The total number of processes.
backend – The distributed backend to use.
- sequifier.train.train(args: Any, args_config: dict[str, Any]) None[source]¶
The main training function.
- Parameters:
args – The command-line arguments.
args_config – The configuration dictionary.
- sequifier.train.train_worker(rank: int, world_size: int, config: TrainModel, from_folder: bool)[source]¶
The worker function for distributed training.
- Parameters:
rank – The rank of the current process.
world_size – The total number of processes.
config – The training configuration.
from_folder – Whether to load data from a folder (e.g., preprocessed .pt files) or a single file (e.g., .parquet).
- class sequifier.infer.Inferer(model_type: str, model_path: str, project_path: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], selected_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]¶
A class for performing inference with a trained sequifier model.
This class encapsulates the model (either ONNX session or PyTorch model), normalization statistics, ID mappings, and all configuration needed to run inference. It provides methods to handle batching, model-specific inference calls (PyTorch vs. ONNX), and post-processing (like inverting normalization).
- model_type¶
‘generative’ or ‘embedding’.
- map_to_id¶
Whether to map integer predictions back to original IDs.
- selected_columns_statistics¶
Dict of ‘mean’ and ‘std’ for real columns.
- index_map¶
The inverse of id_maps, for mapping indices back to values.
- device¶
The device (‘cuda’ or ‘cpu’) for inference.
- target_columns¶
List of columns the model predicts.
- target_column_types¶
Dict mapping target columns to ‘categorical’ or ‘real’.
- inference_model_type¶
‘onnx’ or ‘pt’.
- ort_session¶
onnxruntime.InferenceSession if using ONNX.
- inference_model¶
The loaded PyTorch model if using ‘pt’.
- __init__(model_type: str, model_path: str, project_path: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], selected_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]¶
Initializes the Inferer.
- Parameters:
model_type – The type of model to use for inference.
model_path – The path to the trained model.
project_path – The path to the sequifier project directory.
id_maps – A dictionary of id maps for categorical columns.
selected_columns_statistics – A dictionary of statistics for numerical columns.
map_to_id – Whether to map the output to the original ids.
categorical_columns – A list of categorical columns.
real_columns – A list of real columns.
selected_columns – A list of selected columns.
target_columns – A list of target columns.
target_column_types – A dictionary of target column types.
sample_from_distribution_columns – A list of columns to sample from the distribution.
infer_with_dropout – Whether to use dropout during inference.
inference_batch_size – The batch size for inference.
device – The device to use for inference.
args_config – The command-line arguments.
training_config_path – The path to the training configuration file.
- adjust_and_infer_embedding(x: dict[str, ndarray], size: int)[source]¶
Handles batching and backend-specific calls for embedding inference.
This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx).
- Parameters:
x – The complete dictionary of input features (NumPy arrays).
size – The total number of samples in x, used to truncate any padding added for batching.
- Returns:
A NumPy array of embeddings, concatenated from all batches.
- adjust_and_infer_generative(x: dict[str, ndarray], size: int)[source]¶
Handles batching and backend-specific calls for generative inference.
This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx). It aggregates the results from all batches.
- Parameters:
x – The complete dictionary of input features (NumPy arrays).
size – The total number of samples in x, used to truncate any padding added for batching.
- Returns:
A dictionary mapping target column names to NumPy arrays of raw model outputs (logits or real values).
- expand_to_batch_size(x: ndarray) ndarray[source]¶
Pads a NumPy array to match self.inference_batch_size.
Repeats samples from x until the array’s first dimension is equal to self.inference_batch_size.
- Parameters:
x – The input NumPy array to pad.
- Returns:
A new NumPy array of size self.inference_batch_size in the first dimension.
- infer_embedding(x: dict[str, ndarray]) ndarray[source]¶
Performs inference with an embedding model.
This is a high-level wrapper that calls adjust_and_infer_embedding to handle batching and model-specific logic.
- Parameters:
x – A dictionary mapping feature names to NumPy arrays. All arrays must have the same first dimension (batch size).
- Returns:
A 2D NumPy array of the resulting embeddings.
- infer_generative(x: dict[str, ndarray] | None, probs: dict[str, ndarray] | None = None, return_probs: bool = False) dict[str, ndarray][source]¶
Performs generative inference, returning probabilities or predictions.
This function orchestrates the generative inference process. 1. If probs are not provided, it calls adjust_and_infer_generative
to get the raw model output (logits or real values) using x.
If return_probs is True: - It normalizes the logits for categorical columns to get
probabilities (using softmax, implemented in normalize).
It returns a dictionary of probabilities (for categorical) and raw predicted values (for real).
If return_probs is False (default): - It converts the model outputs (either from x or probs) into
final predictions.
For categorical columns, it either takes the argmax or samples from the distribution (sample_with_cumsum).
For real columns, it returns the value as-is.
- Parameters:
x – A dictionary mapping feature names to NumPy arrays. Required if probs is not provided.
probs – An optional dictionary of probabilities/logits. If provided, this skips the model inference step.
return_probs – If True, returns normalized probabilities for categorical targets. If False, returns final class predictions (via argmax or sampling).
- Returns:
A dictionary mapping target column names to NumPy arrays. The content of the arrays depends on return_probs.
- infer_pure(x: dict[str, ndarray]) list[ndarray][source]¶
Performs a single inference pass using the ONNX session.
This function assumes x is already a single, correctly-sized batch. It formats the input dictionary to match the ONNX model’s input names and executes self.ort_session.run().
- Parameters:
x – A dictionary of feature arrays for a single batch. This batch must be of size self.inference_batch_size.
- Returns:
A list of NumPy arrays, representing the raw outputs from the ONNX model.
- invert_normalization(values: ndarray, target_column: str) ndarray[source]¶
Inverts Z-score normalization for a given target column.
Uses the ‘mean’ and ‘std’ stored in self.selected_columns_statistics to transform normalized values back to their original scale.
- Parameters:
values – A NumPy array of normalized values.
target_column – The name of the column whose statistics should be used for the inverse transformation.
- Returns:
A NumPy array of values in their original scale.
- prepare_inference_batches(x: dict[str, ndarray], pad_to_batch_size: bool) list[dict[str, ndarray]][source]¶
Splits input data into batches for inference.
This function takes a large dictionary of feature arrays and splits them into a list of smaller dictionaries (batches) of size self.inference_batch_size.
- Parameters:
x – A dictionary of feature arrays.
pad_to_batch_size – If True (for ONNX), the last batch will be padded up to self.inference_batch_size by repeating samples. If False (for PyTorch), the last batch may be smaller.
- Returns:
A list of dictionaries, where each dictionary is a single batch ready for inference.
- sequifier.infer.expand_data_by_autoregression(data: DataFrame, autoregression_extra_steps: int, seq_length: int) DataFrame[source]¶
Expands a Polars DataFrame for autoregressive inference.
This function takes a DataFrame of sequences and adds autoregression_extra_steps new rows for each sequence. These new rows represent future time steps to be predicted.
For each new step, it: 1. Copies the last known observation for a sequence. 2. Increments the subsequenceId. 3. Shifts the historical data columns (e.g., ‘1’, ‘2’, …, ‘50’) one
position “older” (e.g., old ‘1’ becomes new ‘2’, old ‘49’ becomes new ‘50’).
Fills the “newest” columns (e.g., new ‘1’ for the first extra step) with np.inf as a placeholder for the prediction.
- Parameters:
data – The input Polars DataFrame, sorted by sequenceId and subsequenceId.
autoregression_extra_steps – The number of future time steps to add to each sequence.
seq_length – The sequence length, used to identify the historical data columns (named ‘1’ through seq_length).
- Returns:
A new Polars DataFrame containing all original rows plus the newly generated future rows with placeholders.
- sequifier.infer.fill_in_predictions_pl(data: DataFrame, preds: dict[str, ndarray], current_subsequence_id: int, sequence_ids_present: Series, seq_length: int) DataFrame[source]¶
Fills in predictions into the main Polars DataFrame using a robust, join-based approach that preserves the original DataFrame’s structure.
This function broadcasts predictions to all relevant future rows via a join, then uses conditional expressions to update only the specific placeholder cells (np.inf) that correspond to the correct future time step.
- Parameters:
data – The main DataFrame containing all sequences.
preds – A dictionary of new predictions, mapping target column names to NumPy arrays.
current_subsequence_id – The adjusted subsequence ID at which predictions were made.
sequence_ids_present – A Polars Series of the sequence IDs in the current batch.
seq_length – The length of the sequence.
- Returns:
An updated Polars DataFrame with the same dimensions as the input, with future placeholder values filled in.
- sequifier.infer.fill_number(number: int | float, max_length: int) str[source]¶
Pads a number with leading zeros to a specified string length.
Used for creating sortable string keys (e.g., “001-001”, “001-002”).
- Parameters:
number – The integer or float to format.
max_length – The total desired length of the output string.
- Returns:
A string representation of the number, padded with leading zeros.
- sequifier.infer.format_delta(time_delta: timedelta) str[source]¶
Formats a timedelta object into a human-readable string (seconds).
- Parameters:
time_delta – The timedelta object to format.
- Returns:
A string representing the total seconds with 3 decimal places.
- sequifier.infer.get_embeddings(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype]) ndarray[source]¶
Generates embeddings from a Polars DataFrame.
This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It uses numpy_to_pytorch for the main conversion, then transforms the tensors to NumPy arrays before passing them to inferer.infer_embedding.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
data – The input Polars DataFrame chunk.
column_types – A dictionary mapping column names to torch.dtype.
- Returns:
A NumPy array containing the computed embeddings for the batch.
- sequifier.infer.get_embeddings_pt(config: Any, inferer: Inferer, data: dict[str, Tensor]) ndarray[source]¶
Generates embeddings from a batch of PyTorch tensor data.
This function serves as a wrapper for Inferer.infer_embedding when the input data is already in PyTorch tensor format (from loading .pt files which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It converts the tensor dictionary to a NumPy array dictionary before passing it to the inferer.
- Parameters:
config – The InfererModel configuration object (unused, but kept for consistent function signature).
inferer – The initialized Inferer instance.
data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).
- Returns:
A NumPy array containing the computed embeddings for the batch.
- sequifier.infer.get_probs_preds(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype]) tuple[dict[str, ndarray] | None, dict[str, ndarray]][source]¶
Generates predictions from a Polars DataFrame (non-autoregressive).
This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It’s used for standard, non-autoregressive generative inference. It calls inferer.infer_generative once and returns the probabilities (if requested) and predictions.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
data – The input Polars DataFrame chunk.
column_types – A dictionary mapping column names to torch.dtype.
- Returns:
probs: A dictionary mapping target columns to NumPy arrays of probabilities, or None if config.output_probabilities is False.
preds: A dictionary mapping target columns to NumPy arrays of final predictions.
- Return type:
A tuple (probs, preds)
- sequifier.infer.get_probs_preds_autoregression(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype], seq_length: int) tuple[dict[str, ndarray] | None, dict[str, ndarray], ndarray][source]¶
Performs autoregressive inference using a time-step-based Polars loop.
This function orchestrates the autoregressive process by iterating through each unique, adjusted time step (subsequenceIdAdjusted).
For each time step: 1. Filters the main DataFrame data to get the current slice of data
for all sequences at that time step.
Calls get_probs_preds to generate predictions for this slice.
Uses fill_in_predictions_pl to update the main data DataFrame, filling in the np.inf placeholders for the next time steps using the predictions just made.
Collects the predictions and a corresponding sort key.
After iterating through all time steps, it sorts all collected predictions based on the keys (sequenceId, subsequenceId) and returns the complete, ordered results.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
data – The input Polars DataFrame, expanded with future rows (see expand_data_by_autoregression).
column_types – A dictionary mapping column names to torch.dtype.
seq_length – The sequence length, passed to fill_in_predictions_pl.
- Returns:
probs: A dictionary mapping target columns to sorted NumPy arrays of probabilities, or None.
preds: A dictionary mapping target columns to sorted NumPy arrays of final predictions.
sequence_ids: A NumPy array of sequenceId`s corresponding to each row in the `preds arrays.
- Return type:
A tuple (probs, preds, sequence_ids)
- sequifier.infer.get_probs_preds_pt(config: Any, inferer: Inferer, data: dict[str, Tensor], extra_steps: int = 0) tuple[dict[str, ndarray] | None, dict[str, ndarray]][source]¶
Generates predictions from PyTorch tensor data, supporting autoregression.
This function performs generative inference on a batch of PyTorch tensor data loaded from .pt files (which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It implements an autoregressive loop: 1. Runs inference on the initial data X (sequences). 2. For each subsequent step (i in extra_steps):
Creates the next input X_next by shifting the previous input X and appending the prediction from the last step.
Runs inference on X_next.
Collects and reshapes all predictions and probabilities from all steps into a single flat batch, ordered by original sample index, then by step.
- Parameters:
config – The InfererModel configuration object, used to check output_probabilities and selected_columns.
inferer – The initialized Inferer instance.
data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).
extra_steps – The number of additional autoregressive steps to perform. A value of 0 means simple, non-autoregressive inference.
- Returns:
probs: A dictionary mapping target columns to NumPy arrays of probabilities, ordered by sample index then step, or None if config.output_probabilities is False.
preds: A dictionary mapping target columns to NumPy arrays of final predictions, ordered by sample index then step.
- Return type:
A tuple (probs, preds)
- sequifier.infer.infer(args: Any, args_config: dict[str, Any]) None[source]¶
Runs the main inference pipeline.
This function orchestrates the inference process. It loads the main inference configuration, retrieves necessary metadata like ID maps and column statistics from a ddconfig file (if required for mapping or normalization), and then delegates the core work to the infer_worker function.
- Parameters:
args – Command-line arguments, typically from argparse. Expected to have attributes like config_path and on_unprocessed.
args_config – A dictionary of configuration overrides, often passed from the command line, that will be merged into the loaded configuration file.
- sequifier.infer.infer_embedding(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, dtype]) None[source]¶
Performs inference with an embedding model and saves the results.
This function iterates through the provided dataset (which can be a list of DataFrames or an iterator of tensors). For each data chunk, it calls the appropriate function (get_embeddings or get_embeddings_pt) to generate embeddings. It then formats these embeddings into a Polars DataFrame, associating them with their sequenceId and subsequenceId, and writes the resulting DataFrame to the configured output path.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
model_id – A string identifier for the model, used for naming output files.
dataset – A list containing a Polars DataFrame (for parquet/csv) or an iterator of loaded PyTorch data (for .pt files).
column_types – A dictionary mapping column names to their torch.dtype.
- sequifier.infer.infer_generative(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, dtype])[source]¶
Performs inference with a generative model and saves the results.
This function manages the generative inference workflow: 1. Iterates through the dataset (chunks). 2. Handles data preparation, including expanding data for autoregression
if configured (expand_data_by_autoregression). It also calculates the corresponding itemPosition for each prediction.
Calls the correct function to get probabilities and predictions based on data format and autoregression settings (e.g., get_probs_preds_autoregression, get_probs_preds_pt).
Post-processes predictions: - Maps integer predictions back to original IDs if map_to_id is True. - Inverts normalization for real-valued target columns.
Saves probabilities to disk (if config.output_probabilities is True).
Saves the final predictions to disk, formatted as a Polars DataFrame with sequenceId, itemPosition, and target columns.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
model_id – A string identifier for the model, used for naming output files.
dataset – A list containing a Polars DataFrame (for parquet/csv) or an iterator of loaded PyTorch data (for .pt files).
column_types – A dictionary mapping column names to their torch.dtype.
- sequifier.infer.infer_worker(config: Any, args_config: dict[str, Any], id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], percentage_limits: tuple[float, float] | None)[source]¶
Core worker function that performs inference.
This function handles the main workflow: 1. Loads the dataset based on config.read_format (parquet, csv, or pt). 2. Iterates over one or more model paths specified in the config. 3. For each model, initializes an Inferer object with all necessary
configurations, mappings, and statistics.
Calls the appropriate inference function (infer_generative or infer_embedding) based on the config.model_type.
Manages the data iterators and passes data chunks to the inference functions.
- Parameters:
config – The fully resolved InfererModel configuration object.
args_config – A dictionary of command-line arguments, passed to the Inferer for potential model loading overrides.
id_maps – A nested dictionary mapping categorical column names to their value-to-index maps. None if map_to_id is False.
selected_columns_statistics – A nested dictionary containing ‘mean’ and ‘std’ for real-valued columns used for normalization.
percentage_limits – A tuple (start_pct, end_pct) used only when config.read_format == “pt” to slice the dataset.
- sequifier.infer.load_pt_dataset(data_path: str, start_pct: float, end_pct: float) Iterator[source]¶
Lazily loads and yields data from .pt files in a directory.
This function scans a directory for .pt files, sorts them, and then yields the contents of a specific slice of those files defined by a start and end percentage. This allows for processing large datasets in chunks without loading everything into memory.
- Parameters:
data_path – The path to the folder containing the .pt files.
start_pct – The starting percentage (0.0 to 100.0) of the file list to begin loading from.
end_pct – The ending percentage (0.0 to 100.0) of the file list to stop loading at.
- Yields:
Iterator – An iterator where each item is the data loaded from a single .pt file (e.g., using torch.load).
- sequifier.infer.normalize(outs: dict[str, ndarray]) dict[str, ndarray][source]¶
Applies the softmax function to a dictionary of logits.
Converts raw model logits for categorical columns into probabilities that sum to 1.
- Parameters:
outs – A dictionary mapping target column names to NumPy arrays of logits.
- Returns:
A dictionary mapping the same target column names to NumPy arrays of probabilities.
- sequifier.infer.sample_with_cumsum(probs: ndarray) ndarray[source]¶
Samples from a probability distribution using the inverse CDF method.
Takes an array of logits, computes the cumulative probability distribution, draws a random number r from [0, 1), and returns the index of the first class i where cumsum[i] > r.
- Parameters:
probs – A 2D NumPy array of logits (not normalized probabilities). Shape is (batch_size, num_classes).
- Returns:
A 1D NumPy array of shape (batch_size,) containing the sampled class indices.
- sequifier.infer.verify_variable_order(data: DataFrame) None[source]¶
Verifies that the DataFrame is correctly sorted for autoregression.
Checks two conditions: 1. sequenceId is globally sorted in ascending order. 2. subsequenceId is sorted in ascending order within each
sequenceId group.
- Parameters:
data – The Polars DataFrame to check.
- Raises:
AssertionError – If sequenceId is not globally sorted or if subsequenceId is not sorted within sequenceId groups.
- sequifier.make.make(args)[source]¶
Creates a new sequifier project.
- Parameters:
args – The command-line arguments.
- class sequifier.hyperparameter_search.HyperparameterSearcher(hyperparameter_search_config)[source]¶
A class for performing hyperparameter search.
Manages the hyperparameter search process based on a given configuration. This class handles sampling hyperparameters, creating training configurations, launching training subprocesses, and logging results.
- sequifier.hyperparameter_search.hyperparameter_search(config_path, on_unprocessed) None[source]¶
Main function for initiating a hyperparameter search process.
This function loads the hyperparameter search configuration, initializes the searcher, and starts the search.
- Parameters:
config_path (str) – Path to the hyperparameter search YAML configuration file.
on_unprocessed (bool) – Flag indicating whether to run the search on unprocessed data.
- Returns:
None
- class sequifier.helpers.LogFile(path: str, open_mode: str, rank: int | None = None)[source]¶
Manages logging to multiple files based on verbosity levels.
This class opens multiple log files based on a path template and a hardcoded list of levels (2 and 3). Messages are written to files based on their assigned level, and high-level messages are also printed to the console on the main process (rank 0).
- rank¶
The rank of the current process, used to control console output.
- Type:
Optional[int]
- levels¶
The hardcoded list of log levels [2, 3] for which files are created.
- Type:
list[int]
- _files¶
A dictionary mapping log levels to their open file handlers.
- Type:
dict[int, io.TextIOWrapper]
- _path¶
The original path template provided.
- Type:
str
- __init__(path: str, open_mode: str, rank: int | None = None)[source]¶
Initializes the LogFile and opens log files.
The path argument should be a template containing “[NUMBER]”, which will be replaced by the log levels (2 and 3) to create separate log files.
- Parameters:
path – The path template for the log files (e.g., “run_log_[NUMBER].txt”).
open_mode – The mode for opening the log files (e.g., “a”, “w”).
rank – The rank of the current process (e.g., in distributed training). If None or 0, high-level messages will be printed to stdout.
- write(string: str, level: int = 3) None[source]¶
Writes a string to log files and potentially the console.
The string is written to all log files whose level is less than or equal to the specified level.
A message with level=2 goes to file 2.
A message with level=3 goes to file 2 and file 3.
If level is 3 or greater, the message is also printed to stdout if self.rank is None or 0.
- Parameters:
string – The message to log.
level – The verbosity level of the message. Defaults to 3.
- sequifier.helpers.construct_index_maps(id_maps: dict[str, dict[str | int, int]] | None, target_columns_index_map: list[str], map_to_id: bool | None) dict[str, dict[int, str | int]][source]¶
Constructs reverse index maps (int index to original ID).
This function creates reverse mappings from the integer indices back to the original string or integer identifiers. It only performs this operation if map_to_id is True and id_maps is provided.
A special mapping for index 0 is added: - If original IDs are strings, 0 maps to “unknown”. - If original IDs are integers, 0 maps to (minimum original ID) - 1.
- Parameters:
id_maps – A nested dictionary mapping column names to their respective ID-to-index maps (e.g., {‘col_name’: {‘original_id’: 1, …}}). Expected to be provided if map_to_id is True.
target_columns_index_map – A list of column names for which to construct the reverse maps.
map_to_id – A boolean flag. If True, the reverse maps are constructed. If False or None, an empty dictionary is returned.
- Returns:
A dictionary where keys are column names from target_columns_index_map and values are the reverse maps (index-to-original-ID). Returns an empty dict if map_to_id is not True.
- Raises:
AssertionError – If map_to_id is True but id_maps is None.
AssertionError – If the values of a map are not consistently string or integer (excluding the added ‘0’ key).
- sequifier.helpers.normalize_path(path: str, project_path: str) str[source]¶
Normalizes a path to be relative to a project path, then joins them.
This function ensures that a given path is correctly expressed as an absolute path rooted at project_path. It does this by first removing the project_path prefix from path (if it exists) and then joining the result back to project_path.
This is useful for handling paths that might be provided as either relative (e.g., “data/file.txt”) or absolute (e.g., “/abs/path/to/project/data/file.txt”).
- Parameters:
path – The path to normalize.
project_path – The absolute path to the project’s root directory.
- Returns:
A normalized, absolute path.
- sequifier.helpers.numpy_to_pytorch(data: DataFrame, column_types: dict[str, dtype], all_columns: list[str], seq_length: int) dict[str, Tensor][source]¶
Converts a long-format Polars DataFrame to a dict of sequence tensors.
This function assumes the input DataFrame data is in a long format where each row represents a sequence for a specific feature. It expects a column named “inputCol” that contains the feature name (e.g., ‘price’, ‘volume’) and other columns representing time steps (e.g., “0”, “1”, …, “L”).
It generates two tensors for each column in all_columns: 1. An “input” tensor (from time steps L down to 1). 2. A “target” tensor (from time steps L-1 down to 0).
Example
For seq_length = 3 and all_columns = [‘price’], it will create: - ‘price’: Tensor from columns [“3”, “2”, “1”] - ‘price_target’: Tensor from columns [“2”, “1”, “0”]
- Parameters:
data – The long-format Polars DataFrame. Must contain “inputCol” and columns named as strings of integers for time steps.
column_types – A dictionary mapping feature names (from “inputCol”) to their desired torch.dtype.
all_columns – A list of all feature names (from “inputCol”) to be processed and converted into tensors.
seq_length – The total sequence length (L). This determines the column names for time steps (e.g., “0” to “L”).
- Returns:
A dictionary mapping feature names to their corresponding PyTorch tensors. Target tensors are stored with a _target suffix (e.g., {‘price’: <tensor>, ‘price_target’: <tensor>}).
- sequifier.helpers.read_data(path: str, read_format: str, columns: list[str] | None = None) DataFrame[source]¶
Reads data from a CSV or Parquet file into a Polars DataFrame.
- Parameters:
path – The file path to read from.
read_format – The format of the file. Supported formats are “csv” and “parquet”.
columns – An optional list of column names to read. This argument is only used when read_format is “parquet”.
- Returns:
A Polars DataFrame containing the data from the file.
- Raises:
ValueError – If read_format is not “csv” or “parquet”.
- sequifier.helpers.subset_to_selected_columns(data: DataFrame | LazyFrame, selected_columns: list[str]) DataFrame | LazyFrame[source]¶
Filters a DataFrame to rows where ‘inputCol’ is in a selected list.
This function supports both Polars (DataFrame, LazyFrame) and Pandas DataFrames, dispatching to the appropriate filtering method.
For Polars objects, it uses data.filter(pl.col(“inputCol”).is_in(…)).
For other objects (presumably Pandas), it builds a numpy boolean mask and filters using data.loc[…].
Note: The type hint only specifies Polars objects, but the implementation includes a fallback path for Pandas-like objects.
- Parameters:
data – The Polars (DataFrame, LazyFrame) or Pandas DataFrame to filter. It must contain a column named “inputCol”.
selected_columns – A list of values. Rows will be kept if their value in “inputCol” is present in this list.
- Returns:
A filtered DataFrame or LazyFrame of the same type as the input.
- sequifier.helpers.write_data(data: DataFrame, path: str, write_format: str, **kwargs) None[source]¶
Writes a Polars (or Pandas) DataFrame to a CSV or Parquet file.
This function detects the type of the input DataFrame. - For Polars DataFrames, it uses .write_csv() or .write_parquet(). - For other DataFrame types (presumably Pandas), it uses .to_csv()
or .to_parquet().
Note: The type hint specifies pl.DataFrame, but the implementation includes a fallback path that suggests compatibility with Pandas DataFrames.
- Parameters:
data – The Polars (or Pandas) DataFrame to write.
path – The destination file path.
write_format – The format to write. Supported formats are “csv” and “parquet”.
**kwargs – Additional keyword arguments passed to the underlying write function (e.g., write_csv for Polars, to_csv for Pandas).
- Returns:
None.
- Raises:
ValueError – If write_format is not “csv” or “parquet”.
- class sequifier.io.yaml.TrainModelDumper(stream, default_style=None, default_flow_style=False, canonical=None, indent=None, width=None, allow_unicode=None, line_break=None, encoding=None, explicit_start=None, explicit_end=None, version=None, tags=None, sort_keys=True)[source]¶
A custom YAML dumper for TrainModel objects.
This dumper extends the base yaml.Dumper to provide custom serialization for TrainModel and related objects, ensuring a clean and readable YAML output. It also modifies the indentation behavior for better formatting.
- increase_indent(flow=False, indentless=False)[source]¶
Increase the indentation level for the YAML output.
This method overrides the default behavior to force indentation for all block-style collections, improving the readability of the output YAML.
- Parameters:
flow – Whether the context is a flow-style collection.
indentless – Whether the context is an indentless sequence.
- Returns:
The result of the parent class’s increase_indent method, with flow forced to False.
- sequifier.io.yaml.represent_dot_dict(dumper, data)[source]¶
Represents DotDict objects as a simple YAML mapping. The original output showed a ‘dictitems’ attribute. If your DotDict is essentially a dictionary, this will work.
- sequifier.io.yaml.represent_numpy_float(dumper, data)[source]¶
Represents numpy.float64 (and similar numpy floats) as standard YAML floats.
- sequifier.io.yaml.represent_numpy_int(dumper, data)[source]¶
Represents numpy.int64 (and similar numpy integers) as standard YAML integers.
- sequifier.io.yaml.represent_sequifier_object(dumper, data)[source]¶
Represents objects from ‘sequifier.config.train_config’ (like TrainModel, ModelSpecModel, TrainingSpecModel) as a simple YAML mapping, using the object’s __dict__. This effectively removes the !!python/object tag and the explicit ‘__dict__:’, ‘__fields_set__:’ keys.
- class sequifier.io.sequifier_dataset_from_folder.SequifierDatasetFromFolder(data_path: str, config: TrainModel)[source]¶
An efficient PyTorch Dataset that pre-loads all data into RAM.
This is the ideal strategy when the entire dataset split can fit into the system’s memory. It pays a one-time I/O cost at initialization, after which all data access during training is extremely fast (RAM access).
- __getitem__(idx: int) Tuple[Dict[str, Tensor], Dict[str, Tensor], int, int, int][source]¶
Retrieves a single sample from the pre-loaded data.
- Parameters:
idx – The index of the sample to retrieve.
- Returns:
sequence (dict): Dictionary of feature tensors for the sample.
targets (dict): Dictionary of target tensors for the sample.
sequence_id (int): The sequence ID of the sample.
subsequence_id (int): The subsequence ID within the sequence.
- start_position (int): The starting item position of the subsequence
within the original full sequence.
- Return type:
A tuple containing
- __init__(data_path: str, config: TrainModel)[source]¶
Initializes the dataset by loading all .pt files from the data directory into memory. Each .pt file is expected to contain a tuple: (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor).
- class sequifier.io.sequifier_dataset_from_folder_lazy.SequifierDatasetFromFolderLazy(data_path: str, config: TrainModel, ram_threshold: float = 70.0)[source]¶
An efficient PyTorch Dataset for datasets that do not fit into RAM.
This class loads data from individual .pt files on-demand (lazily) when an item is requested via __getitem__. It maintains an in-memory cache of recently used files to speed up access. To prevent memory exhaustion, the cache is managed by a Least Recently Used (LRU) policy, which evicts the oldest data chunks when the total system RAM usage exceeds a configurable threshold.
This strategy balances I/O overhead and memory usage, making it suitable for training on datasets larger than the available system memory.
- __getitem__(idx: int) Tuple[Dict[str, Tensor], Dict[str, Tensor], int, int, int][source]¶
Retrieves a single data sample, loading from disk if not in the cache.
This method is the core of the lazy-loading strategy. It is thread-safe and manages the cache automatically.
- Parameters:
idx – The index of the sample to retrieve.
- Returns:
sequence (dict): Dictionary of feature tensors for the sample.
targets (dict): Dictionary of target tensors for the sample.
sequence_id (int): The sequence ID of the sample.
subsequence_id (int): The subsequence ID within the sequence.
- start_position (int): The starting item position of the subsequence
within the original full sequence.
- Return type:
A tuple containing
- __init__(data_path: str, config: TrainModel, ram_threshold: float = 70.0)[source]¶
Initializes the dataset by reading metadata and setting up the cache. Each .pt file is expected to contain a tuple: (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor).
- Parameters:
data_path (str) – The path to the directory containing the pre-processed .pt files and a metadata.json file.
config (TrainModel) – The training configuration object.
ram_threshold (float) – The system RAM usage percentage (0-100) at which to trigger cache eviction.
- class sequifier.io.sequifier_dataset_from_file.SequifierDatasetFromFile(data_path: str, config: TrainModel, shuffle: bool = True)[source]¶
An iterable-style dataset that pre-loads all data into CPU RAM and yields pre-collated batches.
This is the idiomatic PyTorch solution for implementing custom ‘en block’ batching. The __iter__ method handles shuffling and batch slicing, ensuring maximum performance.
- __iter__() Iterator[Tuple[Dict[str, Tensor], Dict[str, Tensor], None, None, None]][source]¶
Yields batches of data.
Handles shuffling (if enabled) and slicing data based on distributed rank and worker ID.
- Yields:
Iterator[Tuple[Dict[str, torch.Tensor], Dict[str, torch.Tensor], None, None, None]] –
- An iterator where each item is a tuple containing:
data_batch (dict): Dictionary of feature tensors for the batch.
targets_batch (dict): Dictionary of target tensors for the batch.
None: Placeholder for sequence_id (not used in this dataset type).
None: Placeholder for subsequence_id (not used in this dataset type).
None: Placeholder for start_position (not used in this dataset type).
- sequifier.optimizers.optimizers.get_optimizer_class(optimizer_name: str) Optimizer[source]¶
Gets the optimizer class from a string. easteregg
- Parameters:
optimizer_name – The name of the optimizer.
- Returns:
The optimizer class.
- class sequifier.samplers.distributed_grouped_random_sampler.DistributedGroupedRandomSampler(data_source: SequifierDatasetFromFolder | SequifierDatasetFromFolderLazy, num_replicas: int, rank: int, shuffle: bool = True, seed: int = 0)[source]¶
A distributed sampler that groups samples by file to improve cache efficiency.
This sampler partitions the set of data FILES across the distributed processes, not the individual samples. Each process then iterates through its assigned files in a random order. Within each file, the samples are also shuffled.
This ensures that each process sees a unique subset of the data per epoch while maximizing sequential reads from the same file, which is ideal for lazy-loading datasets.
- __init__(data_source: SequifierDatasetFromFolder | SequifierDatasetFromFolderLazy, num_replicas: int, rank: int, shuffle: bool = True, seed: int = 0)[source]¶
- Parameters:
data_source – The dataset to sample from. Must have a batch_files_info attribute.
num_replicas – Number of processes participating in distributed training.
rank – Rank of the current process.
shuffle – If True, shuffles the order of files and samples within files.
seed – Random seed used to create the permutation.