This page contains the auto-generated API reference documentation.
Preprocessing Config¶
- class sequifier.config.preprocess_config.PreprocessorModel(*, project_root: str, data_path: str, read_format: str = 'csv', write_format: str = 'parquet', merge_output: bool = True, selected_columns: list[str] | None = None, split_ratios: list[float], seq_length: int, stride_by_split: list[int] | None = None, max_rows: int | None = None, seed: int, n_cores: int | None = None, batches_per_file: int = 1024, process_by_file: bool = True, continue_preprocessing: bool = False, subsequence_start_mode: str = 'distribute', use_precomputed_maps: list[str] | None = None, metadata_config_path: str | None = None)[source]¶
Pydantic model for preprocessor configuration.
- project_root¶
The path to the sequifier project directory.
- Type:
str
- data_path¶
The path to the input data file.
- Type:
str
- read_format¶
The file type of the input data. Can be ‘csv’ or ‘parquet’.
- Type:
str
- write_format¶
The file type for the preprocessed output data.
- Type:
str
- merge_output¶
If True, combines all preprocessed data into a single file.
- Type:
bool
- selected_columns¶
A list of columns to be included in the preprocessing. If None, all columns are used.
- Type:
list[str] | None
- split_ratios¶
A list of floats that define the relative sizes of data splits (e.g., for train, validation, test). The sum of proportions must be 1.0.
- Type:
list[float]
- seq_length¶
The sequence length for the model inputs.
- Type:
int
- stride_by_split¶
A list of step sizes for creating subsequences within each data split.
- Type:
list[int] | None
- max_rows¶
The maximum number of input rows to process. If None, all rows are processed.
- Type:
int | None
- seed¶
A random seed for reproducibility.
- Type:
int
- n_cores¶
The number of CPU cores to use for parallel processing. If None, it uses the available CPU cores.
- Type:
int | None
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- process_by_file¶
A flag to indicate if processing should be done file by file.
- Type:
bool
- continue_preprocessing¶
Continue preprocessing job that was interrupted while writing to temp folder.
- Type:
bool
- subsequence_start_mode¶
“distribute” to minimize max subsequence overlap, or “exact”.
- Type:
str
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Training Config¶
- class sequifier.config.train_config.ModelSpecModel(*, initial_embedding_dim: int, feature_embedding_dims: dict[str, int] | None = None, joint_embedding_dim: int | None = None, dim_model: int, n_head: int, dim_feedforward: int, num_layers: int, activation_fn: str = 'swiglu', normalization: str = 'rmsnorm', positional_encoding: str = 'learned', attention_type: str = 'mha', norm_first: bool = True, n_kv_heads: int | None = None, rope_theta: float = 10000.0, prediction_length: int)[source]¶
Pydantic model for model specifications.
- initial_embedding_dim¶
The size of the input embedding. Must be equal to dim_model if joint_embedding_dim is None.
- Type:
int
- feature_embedding_dims¶
The embedding dimensions for each input column. Must sum to initial_embedding_dim.
- Type:
dict[str, int] | None
- joint_embedding_dim¶
Joint embedding layer after initial embedding. Must be equal to dim_model if specified.
- Type:
int | None
- n_head¶
The number of heads in the multi-head attention models.
- Type:
int
- dim_feedforward¶
The dimension of the feedforward network model.
- Type:
int
- num_layers¶
The number of layers in the transformer model.
- Type:
int
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sequifier.config.train_config.TrainModel(*, project_root: str, metadata_config_path: str, model_name: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', input_columns: list[str], column_types: dict[str, str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: int, n_classes: dict[str, int], inference_batch_size: int, seed: int, export_generative_model: bool, export_embedding_model: bool, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, model_spec: ModelSpecModel, training_spec: TrainingSpecModel)[source]¶
Pydantic model for training configuration.
- project_root¶
The path to the sequifier project directory.
- Type:
str
- metadata_config_path¶
The path to the data-driven configuration file.
- Type:
str
- model_name¶
The name of the model being trained.
- Type:
str
- training_data_path¶
The path to the training data.
- Type:
str
- validation_data_path¶
The path to the validation data.
- Type:
str
- read_format¶
The file format of the input data (e.g., ‘csv’, ‘parquet’).
- Type:
str
- input_columns¶
The list of input columns to be used for training.
- Type:
list[str]
- column_types¶
A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).
- Type:
dict[str, str]
- categorical_columns¶
A list of columns that are categorical.
- Type:
list[str]
- real_columns¶
A list of columns that are real-valued.
- Type:
list[str]
- target_columns¶
The list of target columns for model training.
- Type:
list[str]
- target_column_types¶
A dictionary mapping target columns to their types (‘categorical’ or ‘real’).
- Type:
dict[str, str]
- id_maps¶
For each categorical column, a map from distinct values to their indexed representation.
- Type:
dict[str, dict[str | int, int]]
- seq_length¶
The sequence length of the model’s input.
- Type:
int
- n_classes¶
The number of classes for each categorical column.
- Type:
dict[str, int]
- inference_batch_size¶
The batch size to be used for inference after model export.
- Type:
int
- seed¶
The random seed for numpy and PyTorch.
- Type:
int
- export_generative_model¶
If True, exports the generative model.
- Type:
bool
- export_embedding_model¶
If True, exports the embedding model.
- Type:
bool
- export_onnx¶
If True, exports the model in ONNX format.
- Type:
bool
- export_pt¶
If True, exports the model using torch.save.
- Type:
bool
- export_with_dropout¶
If True, exports the model with dropout enabled.
- Type:
bool
- model_spec¶
The specification of the transformer model architecture.
- training_spec¶
The specification of the training run configuration.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class sequifier.config.train_config.TrainingSpecModel(*, device: str, device_max_concat_length: int = 12, epochs: int, log_interval: int = 10, class_share_log_columns: list[str] = <factory>, early_stopping_epochs: int | None = None, save_interval_epochs: int, save_latest_interval_minutes: float | None = None, save_batch_interval_minutes: float | None = None, save_batch_interval_minutes_val_loss: bool = True, calculate_validation_loss_on_initialization: bool = True, batch_size: int, learning_rate: float, criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: int | None = None, dropout: float = 0.0, loss_weights: dict[str, float] | None = None, optimizer: ~sequifier.config.train_config.DotDict = <factory>, scheduler: ~sequifier.config.train_config.DotDict = <factory>, scheduler_step_on: str = 'epoch', continue_training: bool = True, enforce_determinism: bool = False, distributed: bool = False, load_full_data_to_ram: bool = True, max_ram_gb: int | float = 16, world_size: int = 1, num_workers: int = 0, backend: str = 'nccl', layer_type_dtypes: dict[str, str] | None = None, layer_autocast: bool | None = True, sampling_strategy: str = 'exact', data_parallelism: str | None = None, fsdp_cpu_offload: bool | None = None, torch_compile: str = 'outer', float32_matmul_precision: str = 'highest')[source]¶
Pydantic model for training specifications.
- device¶
The torch.device to train the model on (e.g., ‘cuda’, ‘cpu’, ‘mps’).
- Type:
str
- device_max_concat_length¶
Maximum sequence length for concatenation on device.
- Type:
int
- epochs¶
The total number of epochs to train for.
- Type:
int
- log_interval¶
The interval in batches for logging.
- Type:
int
A list of column names for which to log the class share of predictions.
- Type:
list[str]
- early_stopping_epochs¶
Number of epochs to wait for validation loss improvement before stopping.
- Type:
int | None
- save_interval_epochs¶
The interval in epochs for checkpointing the model.
- Type:
int
- save_latest_interval_minutes¶
the time interval in which a checkpoint is written to the “latest” checkpoint path
- Type:
float | None
- save_batch_interval_minutes¶
the time interval in which a checkpoint is written to a unique checkpoint path
- Type:
float | None
- save_batch_interval_minutes_val_loss¶
calculate val loss at the moment of batch interval saving
- Type:
bool
- calculate_validation_loss_on_initialization¶
calculate val loss on weight initialization
- Type:
bool
- batch_size¶
The training batch size.
- Type:
int
- learning_rate¶
The learning rate.
- Type:
float
- criterion¶
A dictionary mapping each target column to a loss function.
- Type:
dict[str, str]
- class_weights¶
A dictionary mapping categorical target columns to a list of class weights.
- Type:
dict[str, list[float]] | None
- accumulation_steps¶
The number of gradient accumulation steps.
- Type:
int | None
- dropout¶
The dropout value for the transformer model.
- Type:
float
- loss_weights¶
A dictionary mapping columns to specific loss weights.
- Type:
dict[str, float] | None
- optimizer¶
The optimizer configuration.
- Type:
sequifier.config.train_config.DotDict
- scheduler¶
The learning rate scheduler configuration.
- Type:
sequifier.config.train_config.DotDict
- scheduler_step_on¶
The time of the .step() call on the scheduler, either ‘epoch’ or ‘batch’
- Type:
str
- continue_training¶
If True, continue training from the latest checkpoint.
- Type:
bool
- distributed¶
If True, enables distributed training.
- Type:
bool
- load_full_data_to_ram¶
If True, loads the entire dataset into RAM.
- Type:
bool
- world_size¶
The number of processes for distributed training.
- Type:
int
- num_workers¶
The number of worker threads for data loading.
- Type:
int
- backend¶
The distributed training backend (e.g., ‘nccl’).
- Type:
str
- layer_type_dtypes¶
Dictionary mapping layer types (linear, embedding, norm) to dtypes (bfloat16, float8_e4m3fn).
- Type:
dict[str, str] | None
- layer_autocast¶
Whether to use autocast
- Type:
bool | None
- sampling_strategy¶
how to equalize data between GPUs
- Type:
str
- torch_compile¶
compile entire model (‘outer’) or transformer layers (‘inner’) with torch.compile, alternatively ‘none’
- Type:
str
- float32_matmul_precision¶
precision level of float32 computations. One of ‘highest’, ‘high’ and ‘medium’
- Type:
str
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Inference Config¶
- class sequifier.config.infer_config.InfererModel(*, project_root: str, metadata_config_path: str, model_path: str | list[str], model_type: str, data_path: str, training_config_path: str = 'configs/train.yaml', read_format: str = 'parquet', write_format: str = 'csv', input_columns: list[str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], column_types: dict[str, str], target_column_types: dict[str, str], enforce_determinism: bool = False, output_probabilities: bool = False, map_to_id: bool = True, seed: int, device: str, seq_length: int, prediction_length: int = 1, inference_batch_size: int, sample_from_distribution_columns: list[str] | None = None, infer_with_dropout: bool = False, autoregression: bool = False, autoregression_total_steps: int | None = None)[source]¶
Pydantic model for inference configuration.
- project_root¶
The path to the sequifier project directory.
- Type:
str
- metadata_config_path¶
The path to the data-driven configuration file.
- Type:
str
- model_path¶
The path to the trained model file(s).
- Type:
str | list[str]
- model_type¶
The type of model, either ‘embedding’ or ‘generative’.
- Type:
str
- data_path¶
The path to the data to be used for inference.
- Type:
str
- training_config_path¶
The path to the training configuration file.
- Type:
str
- read_format¶
The file format of the input data (e.g., ‘csv’, ‘parquet’).
- Type:
str
- write_format¶
The file format for the inference output.
- Type:
str
- input_columns¶
The list of input columns used for inference.
- Type:
list[str]
- categorical_columns¶
A list of columns that are categorical.
- Type:
list[str]
- real_columns¶
A list of columns that are real-valued.
- Type:
list[str]
- target_columns¶
The list of target columns for inference.
- Type:
list[str]
- column_types¶
A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).
- Type:
dict[str, str]
- target_column_types¶
A dictionary mapping target columns to their types (‘categorical’ or ‘real’).
- Type:
dict[str, str]
- output_probabilities¶
If True, outputs the probability distributions for categorical target columns.
- Type:
bool
- map_to_id¶
If True, maps categorical output values back to their original IDs.
- Type:
bool
- seed¶
The random seed for reproducibility.
- Type:
int
- device¶
The device to run inference on (e.g., ‘cuda’, ‘cpu’, ‘mps’).
- Type:
str
- seq_length¶
The sequence length of the model’s input.
- Type:
int
- inference_batch_size¶
The batch size for inference.
- Type:
int
- sample_from_distribution_columns¶
A list of columns from which to sample from the distribution.
- Type:
list[str] | None
- infer_with_dropout¶
If True, applies dropout during inference.
- Type:
bool
- autoregression¶
If True, performs autoregressive inference.
- Type:
bool
- autoregression_total_steps¶
The number of total steps for autoregressive inference.
- Type:
int | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Hyperparameter Search Config¶
- class sequifier.config.hyperparameter_search_config.HyperparameterSearchConfig(*, project_root: str, metadata_config_path: str, hp_search_name: str, search_strategy: str = 'bayesian', n_samples: int | None = None, prune_trials: bool | None = True, model_config_write_path: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', input_columns: list[list[str]], column_types: list[dict[str, str]], categorical_columns: list[list[str]], real_columns: list[list[str]], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: list[int], n_classes: dict[str, int], inference_batch_size: int, export_generative_model: bool, export_embedding_model: bool, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, evaluation_inference_config: str | None = None, evaluation_script: str | None = None, evaluation_metric_directions: list[str] | None = None, evaluation_metrics: list[str] | None = None, model_hyperparameter_sampling: ModelSpecHyperparameterSampling, training_hyperparameter_sampling: TrainingSpecHyperparameterSampling, override_input: bool = False)[source]¶
Pydantic model for hyperparameter search configuration.
- project_root¶
The path to the sequifier project directory.
- Type:
str
- metadata_config_path¶
The path to the data-driven configuration file.
- Type:
str
- hp_search_name¶
The name for the hyperparameter search.
- Type:
str
- search_strategy¶
The search strategy, either “sample” or “grid”.
- Type:
str
- n_samples¶
The number of samples to draw for the search.
- model_config_write_path¶
The path to write the model configurations to.
- Type:
str
- training_data_path¶
The path to the training data.
- Type:
str
- validation_data_path¶
The path to the validation data.
- Type:
str
- read_format¶
The file format of the input data.
- Type:
str
- input_columns¶
A list of lists of columns to be used for training.
- Type:
list[list[str]]
- column_types¶
A list of dictionaries mapping columns to their types.
- Type:
list[dict[str, str]]
- categorical_columns¶
A list of lists of categorical columns.
- Type:
list[list[str]]
- real_columns¶
A list of lists of real-valued columns.
- Type:
list[list[str]]
- target_columns¶
The list of target columns for model training.
- Type:
list[str]
- target_column_types¶
A dictionary mapping target columns to their types.
- Type:
dict[str, str]
- id_maps¶
A dictionary mapping categorical values to their indexed representation.
- Type:
dict[str, dict[str | int, int]]
- seq_length¶
A list of possible sequence lengths.
- Type:
list[int]
- n_classes¶
The number of classes for each categorical column.
- Type:
dict[str, int]
- inference_batch_size¶
The batch size for inference.
- Type:
int
- export_onnx¶
If True, exports the model in ONNX format.
- Type:
bool
- export_pt¶
If True, exports the model using torch.save.
- Type:
bool
- export_with_dropout¶
If True, exports the model with dropout enabled.
- Type:
bool
- model_hyperparameter_sampling¶
The sampling configuration for model hyperparameters.
- training_hyperparameter_sampling¶
The sampling configuration for training hyperparameters.
- evaluation_inference_config¶
The inference config to infer on for hyperparameter search optimization
- Type:
str | None
- evaluation_script¶
The script that outputs the evaluation metrics, typically from the inference output
- Type:
str | None
- evaluation_metrics¶
The evaluation metrics to optimize during hyperparameter search
- Type:
list[str] | None
- evaluation_metric_directions¶
The direction to optimize evaluation_metrics in. Only ‘minimize’ and ‘maximize’ are allowed
- Type:
list[str] | None
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sample_trial(trial: Any, run_index: int) TrainModel[source]¶
Generates a complete training configuration using an Optuna trial.
This method orchestrates the sampling of both model and training specifications, as well as data sequence parameters, combining them into a final configuration ready for model execution.
- Parameters:
trial (Any) – The Optuna trial object used for suggesting hyperparameters.
run_index (int) – The current run/trial index, used to assign a unique name to the model.
- Returns:
A fully populated configuration instance for the current trial.
- Return type:
- class sequifier.config.hyperparameter_search_config.ModelSpecHyperparameterSampling(*, initial_embedding_dim: list[int], joint_embedding_dim: list[Optional[int]], dim_model: list[int], feature_embedding_dims: list[dict[str, int]] | None, n_head: list[int], dim_feedforward: list[int] | IntDistribution, num_layers: list[int] | IntDistribution, prediction_length: int, activation_fn: list[str], normalization: list[str], positional_encoding: list[str], attention_type: list[str], norm_first: list[bool], n_kv_heads: list[Optional[int]], rope_theta: list[float] | FloatDistribution)[source]¶
Pydantic model for model specification hyperparameter sampling.
- initial_embedding_dim¶
A list of possible sizes for the initial input embedding.
- Type:
list[int]
- feature_embedding_dims¶
A list of possible dictionaries defining embedding dimensions for each input column.
- Type:
list[dict[str, int]] | None
- joint_embedding_dim¶
A list of possible sizes for the joint embedding layer projection.
- Type:
list[Optional[int]]
- dim_model¶
A list of possible numbers of expected features in the input (d_model).
- Type:
list[int]
- n_head¶
A list of possible numbers of heads in the multi-head attention models.
- Type:
list[int]
- dim_feedforward¶
A list of possible dimensions of the feedforward network model.
- Type:
list[int] | sequifier.config.hyperparameter_search_config.IntDistribution
- num_layers¶
A list of possible numbers of layers in the transformer model.
- Type:
list[int] | sequifier.config.hyperparameter_search_config.IntDistribution
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sample_trial(trial: Any) ModelSpecModel[source]¶
Samples model architecture hyperparameters using an Optuna trial.
This method uses the Optuna trial to suggest structural parameters such as the number of layers, feedforward dimensions, and attention heads. It ensures that dependent dimensions (like n_head and dim_model) stay correctly paired and that invalid key-value head combinations are filtered out.
- Parameters:
trial (Any) – The Optuna trial object used for suggesting hyperparameters.
- Returns:
A populated model specification model with the sampled architecture parameters.
- Return type:
- class sequifier.config.hyperparameter_search_config.TrainingSpecHyperparameterSampling(*, device: str, epochs: list[int], log_interval: int = 10, class_share_log_columns: list[str] = <factory>, early_stopping_epochs: int | None = None, save_interval_epochs: int, save_latest_interval_minutes: float | None = None, save_batch_interval_minutes: float | None = None, save_batch_interval_minutes_val_loss: bool = True, calculate_validation_loss_on_initialization: bool = False, batch_size: list[int] | ~sequifier.config.hyperparameter_search_config.IntDistribution, learning_rate: list[float], criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: list[int] | ~sequifier.config.hyperparameter_search_config.IntDistribution, dropout: list[float] | ~sequifier.config.hyperparameter_search_config.FloatDistribution = [0.0], loss_weights: dict[str, float] | None = None, optimizer: list[sequifier.config.train_config.DotDict] = <factory>, scheduler: list[sequifier.config.train_config.DotDict] = <factory>, continue_training: bool, scheduler_step_on: str = 'epoch', distributed: bool = False, load_full_data_to_ram: bool = True, max_ram_gb: int | float = 16, device_max_concat_length: int = 12, world_size: int = 1, num_workers: int = 0, backend: str = 'nccl', layer_type_dtypes: dict[str, str] | None = None, layer_autocast: bool | None = True, sampling_strategy: str = 'exact', data_parallelism: str | None = None, fsdp_cpu_offload: bool | None = None, torch_compile: str = 'outer', float32_matmul_precision: str = 'highest')[source]¶
Pydantic model for training specification hyperparameter sampling.
- device¶
The device to train on (e.g., ‘cuda’, ‘cpu’).
- Type:
str
- epochs¶
A list of possible numbers of epochs to train for.
- Type:
list[int]
- log_interval¶
The interval in batches for logging.
- Type:
int
Columns for which to log class share.
- Type:
list[str]
- early_stopping_epochs¶
Number of epochs for early stopping.
- Type:
int | None
- save_interval_epochs¶
Interval in epochs for saving model checkpoints.
- Type:
int
- save_latest_interval_minutes¶
the time interval in which a checkpoint is written to the “latest” checkpoint path
- Type:
float | None
- save_batch_interval_minutes¶
the time interval in which a checkpoint is written to a unique checkpoint path
- Type:
float | None
- save_batch_interval_minutes_val_loss¶
calculate val loss at the moment of batch interval saving
- Type:
bool
- calculate_validation_loss_on_initialization¶
calculate val loss on weight initialization
- Type:
bool
- batch_size¶
A list of possible batch sizes.
- Type:
list[int] | sequifier.config.hyperparameter_search_config.IntDistribution
- learning_rate¶
A list of possible learning rates.
- Type:
list[float]
- criterion¶
A dictionary mapping target columns to loss functions.
- Type:
dict[str, str]
- class_weights¶
Optional dictionary mapping columns to class weights.
- Type:
dict[str, list[float]] | None
- accumulation_steps¶
A list of possible gradient accumulation steps.
- Type:
list[int] | sequifier.config.hyperparameter_search_config.IntDistribution
- dropout¶
A list of possible dropout rates.
- Type:
list[float] | sequifier.config.hyperparameter_search_config.FloatDistribution
- loss_weights¶
Optional dictionary mapping columns to loss weights.
- Type:
dict[str, float] | None
- optimizer¶
A list of possible optimizer configurations.
- Type:
list[sequifier.config.train_config.DotDict]
- scheduler¶
A list of possible scheduler configurations.
- Type:
list[sequifier.config.train_config.DotDict]
- continue_training¶
Flag to continue training from a checkpoint.
- Type:
bool
- layer_type_dtypes¶
Dictionary mapping layer types (linear, embedding, norm) to dtypes (bfloat16, float8_e4m3fn).
- Type:
dict[str, str] | None
- layer_autocast¶
Whether to use autocast
- Type:
bool | None
- sampling_strategy¶
data sampling in distributed training: ‘exact’, ‘oversampling’ or ‘undersampling’
- Type:
str
- data_parallelism¶
‘DDP’ or ‘FSDP’
- Type:
str | None
- fsdp_cpu_offload¶
fsdp cpu offload
- Type:
bool | None
- torch_compile¶
compile entire model (‘outer’) or transformer layers (‘inner’) with torch.compile, alternatively ‘none’
- Type:
str
- float32_matmul_precision¶
precision level of float32 computations. One of ‘highest’, ‘high’ and ‘medium’
- Type:
str
- __init__(**kwargs)[source]¶
Initialize the TrainingSpecHyperparameterSampling instance.
This method initializes the Pydantic BaseModel and then processes the optimizer and scheduler configurations from the provided keyword arguments, converting them into DotDict objects.
- Parameters:
**kwargs – Keyword arguments that correspond to the attributes of this class. The ‘optimizer’ and ‘scheduler’ arguments are expected to be lists of dictionaries.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- sample_trial(trial: Any) TrainingSpecModel[source]¶
Samples training hyperparameters using an Optuna trial.
This method leverages the provided Optuna trial to suggest values for hyperparameters like batch size, dropout, and learning rate based on the defined search spaces (categorical lists or distributions).
- Parameters:
trial (Any) – The Optuna trial object used for suggesting hyperparameters.
- Returns:
A populated training specification model with the sampled hyperparameters.
- Return type:
Non-standard Optimizers¶
- class sequifier.optimizers.ademamix.AdEMAMix(params={}, lr=0.001, betas=(0.9, 0.999, 0.9999), eps=1e-08, weight_decay=0, alpha=5.0, T_alpha_beta3=None)[source]¶
Implements the AdEMAMix optimizer.
This optimizer is based on the paper “AdEMAMix: A Novel Adaptive Optimizer for Deep Learning”. It combines the advantages of Adam and EMA, and introduces a mixing term to further improve performance.
- Parameters:
params (iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
learning_rate (float, optional) – Learning rate (default: 1e-3).
betas (Tuple[float, float, float], optional) – Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999, 0.9999)).
eps (float, optional) – Term added to the denominator to improve numerical stability (default: 1e-8).
weight_decay (float, optional) – Weight decay (L2 penalty) (default: 0).
alpha (float, optional) – Mixing coefficient (default: 5.0).
T_alpha_beta3 (int, optional) – Time period for alpha and beta3 scheduling (default: None).
Internals¶
- sequifier.sequifier.build_args_config(args: Any) dict[str, Any][source]¶
Build configuration dictionary from command-line arguments.
- Parameters:
args – Parsed command-line arguments.
- Returns:
Dictionary containing configuration options.
- sequifier.sequifier.setup_parser() ArgumentParser[source]¶
Set up the argument parser for the command-line interface.
- Returns:
Configured ArgumentParser object.
- class sequifier.preprocess.Preprocessor(project_root: str, continue_preprocessing: bool, data_path: str, read_format: str, write_format: str, merge_output: bool, selected_columns: list[str] | None, split_ratios: list[float], seq_length: int, stride_by_split: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool, subsequence_start_mode: str, use_precomputed_maps: list[str] | None, metadata_config_path: str | None)[source]¶
A class for preprocessing data for the sequifier model.
This class handles loading, preprocessing, and saving data. It supports single-file and multi-file processing, and can handle large datasets by processing them in batches.
- project_root¶
The path to the sequifier project directory.
- Type:
str
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- data_name_root¶
The root name of the data file.
- Type:
str
- merge_output¶
Whether to combine the output into a single file.
- Type:
bool
- target_dir¶
The target directory for temporary files.
- Type:
str
- seed¶
The random seed for reproducibility.
- Type:
int
- n_cores¶
The number of cores to use for parallel processing.
- Type:
int
- split_paths¶
The paths to the output split files.
- Type:
list[str]
- __init__(project_root: str, continue_preprocessing: bool, data_path: str, read_format: str, write_format: str, merge_output: bool, selected_columns: list[str] | None, split_ratios: list[float], seq_length: int, stride_by_split: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool, subsequence_start_mode: str, use_precomputed_maps: list[str] | None, metadata_config_path: str | None)[source]¶
Initializes the Preprocessor with the given parameters.
- Parameters:
project_root – The path to the sequifier project directory.
data_path – The path to the input data file.
read_format – The file type of the input data.
write_format – The file type for the preprocessed output data.
merge_output – Whether to combine the output into a single file.
selected_columns – A list of columns to be included in the preprocessing.
split_ratios – A list of floats that define the relative sizes of data splits.
seq_length – The sequence length for the model inputs.
stride_by_split – A list of step sizes for creating subsequences.
max_rows – The maximum number of input rows to process.
seed – A random seed for reproducibility.
n_cores – The number of CPU cores to use for parallel processing.
batches_per_file – The number of batches to process per file.
process_by_file – A flag to indicate if processing should be done file by file.
use_precomputed_maps – An optional list of columns for which to enforce precomputed maps
metadata_config_path – Optional path to a precomputed metadata config
- sequifier.preprocess.cast_columns_to_string(data: DataFrame) DataFrame[source]¶
Casts the column names of a Polars DataFrame to strings.
This is often necessary because Polars schemas may use integers as column names (e.g., ‘0’, ‘1’, ‘2’…) which need to be strings for some operations.
- Parameters:
data – The Polars DataFrame.
- Returns:
The same DataFrame with its columns attribute modified.
- sequifier.preprocess.combine_maps(map1: dict[Union[str, int], int], map2: dict[Union[str, int], int]) dict[Union[str, int], int][source]¶
Combines two ID maps into a new, consolidated map.
Takes all unique keys from both map1 and map2, sorts them, and creates a new, single map where keys are mapped to 1-based indices based on the sorted order. This ensures a consistent mapping across different data chunks.
- Parameters:
map1 – The first ID map.
map2 – The second ID map.
- Returns:
A new, combined, and re-indexed ID map.
- sequifier.preprocess.combine_multiprocessing_outputs(project_root: str, target_dir: str, n_splits: int, input_files: dict[int, list[str]], dataset_name: str, write_format: str, in_target_dir: bool = False, pre_split_str: str | None = None, post_split_str: str | None = None) None[source]¶
Combines multiple intermediate batch files into final split files.
This function iterates through each split and combines all the intermediate files listed in input_files[split] into a single final output file for that split.
For “csv” format, it uses the csvstack command-line utility.
For “parquet” format, it uses pyarrow.parquet.ParquetWriter to concatenate the files efficiently.
- Parameters:
project_root – The path to the sequifier project directory.
target_dir – The temporary directory containing intermediate files.
n_splits – The number of data splits.
input_files – A dictionary mapping split index (int) to a list of input file paths (str) for that split.
dataset_name – The root name for the final output files.
write_format – The file format (“csv” or “parquet”).
in_target_dir – If True, the final combined file is written inside target_dir. If False, it’s written to data/.
pre_split_str – An optional string to insert into the filename before the “-split{i}” part.
post_split_str – An optional string to insert into the filename after the “-split{i}” part.
- sequifier.preprocess.combine_parquet_files(files: list[str], out_path: str) None[source]¶
Combines multiple Parquet files into a single Parquet file.
This function reads the schema from the first file and uses it to initialize a ParquetWriter. It then iterates through all files in the list, reading each one as a table and writing it to the new combined file. This is more memory-efficient than reading all files into one large table first.
- Parameters:
files – A list of paths to the Parquet files to combine.
out_path – The path for the combined output Parquet file.
- sequifier.preprocess.create_file_paths_for_multiple_files1(project_root: str, target_dir: str, n_splits: int, n_batches: int, process_id: int, file_index_str: str, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a specific data file.
This is used in the multi-file, merge_output=True workflow. It generates file path names for intermediate batches before they are combined.
The naming pattern is: {dataset_name}-{process_id}-{file_index_str}-split{split}-{batch_id}.{write_format}
- Parameters:
project_root – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of batches created by the process.
process_id – The ID of the multiprocessing worker.
file_index_str – The index of the file being processed by this worker.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_file_paths_for_multiple_files2(project_root: str, target_dir: str, n_splits: int, n_processes: int, n_files: dict[int, int], dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of intermediate file paths for a multi-file run.
This is used in the multi-file, merge_output=True workflow. It generates the file paths for the combined files from each process, which are the inputs to the final combination step.
The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}.{write_format}
- Parameters:
project_root – The path to the sequifier project directory.
target_dir – The temporary directory where files are located.
n_splits – The number of data splits.
n_processes – The total number of multiprocessing workers.
n_files – A dictionary mapping process_id to the number of files that process handled.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of all intermediate combined file paths (str) for that split.
- sequifier.preprocess.create_file_paths_for_single_file(project_root: str, target_dir: str, n_splits: int, n_batches: int, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a single-file run.
This is used in the single-file, merge_output=True workflow. It generates file path names for intermediate batches created by different processes before they are combined.
The naming pattern is: {dataset_name}-split{split}-{core_id}.{write_format}
- Parameters:
project_root – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of processes (batches) running in parallel.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_id_map(data: DataFrame, column: str) dict[Union[str, int], int][source]¶
Creates a map from unique values in a column to integer indices.
Finds all unique values in the specified column of the data DataFrame, sorts them, and creates a dictionary mapping each unique value to a 1-based integer index.
- Parameters:
data – The Polars DataFrame containing the column.
column – The name of the column to map.
- Returns:
A dictionary mapping unique values (str or int) to an integer index (starting from 1).
- sequifier.preprocess.delete_files(files: list[str] | dict[int, list[str]]) None[source]¶
Deletes a list of files from the filesystem.
- Parameters:
files – A list of file paths to delete, or a dictionary whose values are lists of file paths to delete.
- sequifier.preprocess.extract_sequences(data: DataFrame, schema: Any, seq_length: int, stride_for_split: int, columns: list[str], subsequence_start_mode: str) DataFrame[source]¶
Extracts subsequences from a DataFrame of full sequences.
This function takes a DataFrame where each row contains all items for a single sequenceId. It iterates through each sequenceId, extracts all possible subsequences of seq_length using the specified stride_for_split, calculates the starting position of each subsequence within the original sequence, and formats them into a new, long-format DataFrame that conforms to the provided schema.
- Parameters:
data – The input Polars DataFrame, grouped by “sequenceId”.
schema – The schema for the output long-format DataFrame.
seq_length – The length of the subsequences to extract.
stride_for_split – The step size to use when sliding the window to create subsequences.
columns – A list of the data column names (features) to extract.
subsequence_start_mode – “distribute” to minimize max subsequence overlap, or “exact”.
- Returns:
A new, long-format Polars DataFrame containing the extracted subsequences, matching the provided schema. Includes columns for sequenceId, subsequenceId, startItemPosition, inputCol, and the sequence items (‘0’, ‘1’, …).
- sequifier.preprocess.extract_subsequences(in_seq: dict[str, list], seq_length: int, stride_for_split: int, columns: list[str], subsequence_start_mode: str) dict[str, list[list[Union[float, int]]]][source]¶
Extracts subsequences from a dictionary of sequence lists.
This function takes a dictionary in_seq where keys are column names and values are lists of items for a single full sequence. It first pads the sequences with 0s at the beginning if they are shorter than seq_length. Then, it calculates the subsequence start indices using get_subsequence_starts and extracts all subsequences.
- Parameters:
in_seq – A dictionary mapping column names to lists of items (e.g., {‘col_A’: [1, 2, 3, 4, 5], ‘col_B’: [6, 7, 8, 9, 10]}).
seq_length – The length of the subsequences to extract.
stride_for_split – The desired step size between subsequences.
columns – A list of the column names (keys in in_seq) to process.
subsequence_start_mode – “distribute” to minimize max subsequence overlap, or “exact”.
- Returns:
A dictionary mapping column names to a list of lists, where each inner list is a subsequence.
- sequifier.preprocess.get_batch_limits(data: DataFrame, n_batches: int) list[tuple[int, int]][source]¶
Calculates row indices to split a DataFrame into batches.
This function divides the DataFrame into n_batches roughly equal chunks. Crucially, it ensures that no sequenceId is split across two different batches. It does this by finding the ideal split points and then adjusting them to the nearest sequenceId boundary.
- Parameters:
data – The DataFrame to split. Must be sorted by “sequenceId”.
n_batches – The desired number of batches.
- Returns:
A list of (start_index, end_index) tuples, where each tuple defines the row indices for a batch.
- sequifier.preprocess.get_combined_statistics(n1: int, mean1: float, std1: float, n2: int, mean2: float, std2: float) tuple[float, float][source]¶
Calculates the combined mean and standard deviation of two data subsets.
Uses a stable parallel algorithm (related to Welford’s algorithm) to combine statistics from two subsets without needing the original data.
- Parameters:
n1 – Number of samples in subset 1.
mean1 – Mean of subset 1.
std1 – Standard deviation of subset 1.
n2 – Number of samples in subset 2.
mean2 – Mean of subset 2.
std2 – Standard deviation of subset 2.
- Returns:
A tuple (combined_mean, combined_std) containing the combined mean and standard deviation of the two subsets.
- sequifier.preprocess.get_group_bounds(data_subset: DataFrame, split_ratios: list[float])[source]¶
Calculates row indices for splitting a sequence into groups.
This function takes a DataFrame data_subset (which typically contains all items for a single sequenceId) and calculates the row indices to split it into multiple groups (e.g., train, val, test) based on the provided split_ratios.
- Parameters:
data_subset – The DataFrame (for a single sequence) to split.
split_ratios – A list of floats (e.g., [0.8, 0.1, 0.1]) that sum to 1.0, defining the relative sizes of the splits.
- Returns:
A list of (start_index, end_index) tuples, one for each proportion, defining the row slices for each group.
- sequifier.preprocess.get_subsequence_starts(in_seq_length: int, seq_length: int, stride_for_split: int, subsequence_start_mode: str) ndarray[source]¶
Calculates the start indices for extracting subsequences.
This function determines the starting indices for sliding a window of seq_length over an input sequence of in_seq_length. It aims to use stride_for_split, but adjusts the step size slightly to ensure that the windows are distributed as evenly as possible and cover the full sequence from the beginning to the end.
- Parameters:
in_seq_length – The length of the original input sequence.
seq_length – The length of the subsequences to extract.
stride_for_split – The desired step size between subsequences.
subsequence_start_mode – “distribute” to minimize max subsequence overlap, or “exact”.
- Returns:
A numpy array of integer start indices for each subsequence.
- sequifier.preprocess.insert_top_folder(path: str, folder_name: str) str[source]¶
Inserts a directory name into a file path, just before the filename.
Example
insert_top_folder(“a/b/c.txt”, “temp”) returns “a/b/temp/c.txt”
- Parameters:
path – The original file path.
folder_name – The name of the folder to insert.
- Returns:
The new path string with the folder inserted.
- sequifier.preprocess.load_precomputed_id_maps(project_root: str, data_columns: list[str] | None, required_maps: list[str] | None = None) dict[str, dict[Union[str, int], int]][source]¶
Loads custom ID maps from configs/id_maps if the folder exists.
- Parameters:
project_root – The path to the project root directory.
data_columns – Optional list of columns present in the data to validate against the found map files.
required_maps – Optional list of columns for which a precomputed id_map is required
- Returns:
A dictionary mapping column names to their ID maps.
- sequifier.preprocess.preprocess(args: Any, args_config: dict[str, Any]) None[source]¶
Runs the main data preprocessing pipeline.
This function loads the preprocessing configuration, initializes the Preprocessor class, and executes the preprocessing steps based on the loaded configuration.
- Parameters:
args – An object containing command-line arguments. Expected to have a config_path attribute specifying the path to the YAML configuration file.
args_config – A dictionary containing additional configuration parameters that may override or supplement the settings loaded from the config file.
- sequifier.preprocess.preprocess_batch(project_root: str, data_name_root: str, process_id: int, batch: DataFrame, schema: Any, split_paths: list[str], seq_length: int, stride_by_split: list[int], data_columns: list[str], col_types: dict[str, str], split_ratios: list[float], target_dir: str, write_format: str, batches_per_file: int, subsequence_start_mode: str, merge_output: bool) None[source]¶
Processes a batch of data.
- Parameters:
project_root – The path to the sequifier project directory.
data_name_root – The root name of the data file.
process_id – The id of the process.
batch – The batch of data to process.
schema – The schema for the preprocessed data.
split_paths – The paths to the output split files.
seq_length – The sequence length for the model inputs.
stride_by_split – A list of step sizes for creating subsequences.
data_columns – A list of data columns.
col_types – A dictionary containing the column types.
split_ratios – A list of floats that define the relative sizes of data splits.
target_dir – The target directory for temporary files.
write_format – The file format for the output files.
batches_per_file – The number of batches to process per file.
subsequence_start_mode – “distribute” to minimize max subsequence overlap, or “exact”.
- sequifier.preprocess.process_and_write_data_pt(data: DataFrame, seq_length: int, path: str, column_types: dict[str, str])[source]¶
Processes the sequence DataFrame and writes it to a .pt file.
This function takes the long-format sequence DataFrame (data), aggregates it by sequenceId and subsequenceId, and pivots it so that each inputCol becomes its own column containing a list of sequence items. It also extracts the startItemPosition.
It then converts these lists into NumPy arrays, splits them into sequences (all but last item) and targets (all but first item), and converts them to PyTorch tensors along with sequence/subsequence IDs and start positions. The final data tuple (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor) is saved to a .pt file using torch.save.
- Parameters:
data – The long-format Polars DataFrame of extracted sequences.
seq_length – The total sequence length (N). The resulting tensors will have sequence length N-1.
path – The output file path (e.g., “data/batch_0.pt”).
column_types – A dictionary mapping column names to their string data types, used to determine the correct torch dtype.
- class sequifier.train.TransformerEmbeddingModel(transformer_model: TransformerModel)[source]¶
A wrapper around the TransformerModel to expose the embedding functionality.
- __init__(transformer_model: TransformerModel)[source]¶
Initializes the TransformerEmbeddingModel.
- Parameters:
transformer_model – The TransformerModel to wrap.
- class sequifier.train.TransformerModel(hparams: Any, rank: int | None = None, local_rank: int | None = None)[source]¶
The main Transformer model for the sequifier.
This class implements the Transformer model, including the training and evaluation loops, as well as the export functionality.
- __init__(hparams: Any, rank: int | None = None, local_rank: int | None = None)[source]¶
Initializes the TransformerModel.
Based on the hyperparameters, this initializes: - Embeddings for categorical and real features (self.encoder) - Positional encoders (self.pos_encoder) - The main TransformerEncoder (self.transformer_encoder) - Output decoders for each target column (self.decoder) - Loss functions (self.criterion) - Optimizer (self.optimizer) and scheduler (self.scheduler)
- Parameters:
hparams – The hyperparameters for the model (e.g., from TrainModel config).
rank – The rank of the current process (for distributed training).
- apply_softmax(target_column: str, output: Tensor) Tensor[source]¶
Applies softmax to the output of the decoder.
If the target is real, it returns the output unchanged. If the target is categorical, it applies LogSoftmax.
- Parameters:
target_column – The name of the target column.
output – The decoded output tensor (logits or real value).
- Returns:
The output tensor, with LogSoftmax applied if categorical.
- decode(target_column: str, output: Tensor) Tensor[source]¶
Decodes the output of the transformer encoder.
Applies the appropriate final linear layer for a given target column.
- Parameters:
target_column – The name of the target column to decode.
output – The raw output tensor from the TransformerEncoder (seq_length, batch_size, dim_model).
- Returns:
The decoded output (logits or real value) for the target column (seq_length, batch_size, n_classes/1).
- forward(src: dict[str, torch.Tensor], return_logits: bool | Tensor = False) dict[str, torch.Tensor][source]¶
The main forward pass of the model.
This is typically used for inference/evaluation, returning the probabilities/values for the last token in the sequence.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
return_logits – Return logits
- Returns:
A dictionary mapping target column names to their final output (LogSoftmax probabilities or real values) for the last token (batch_size, n_classes/1).
- forward_embed(src: dict[str, torch.Tensor]) Tensor[source]¶
Forward pass for the embedding model.
This returns only the embedding from the last token in the sequence.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
The embedding tensor for the last token (batch_size, dim_model).
- forward_inner(src: dict[str, torch.Tensor]) Tensor[source]¶
The inner forward pass of the model.
This handles embedding lookup, positional encoding, and passing the combined tensor through the transformer encoder.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
The raw output tensor from the TransformerEncoder (seq_length, batch_size, dim_model).
- forward_train(src: dict[str, torch.Tensor]) dict[str, torch.Tensor][source]¶
Forward pass for training.
This runs the inner forward pass and then applies the appropriate decoder for each target column.
- Parameters:
src – A dictionary mapping column names to input tensors (batch_size, seq_length).
- Returns:
A dictionary mapping target column names to their raw output (logit) tensors (seq_length, batch_size, n_classes/1).
- train_model(train_loader: DataLoader, valid_loader: DataLoader, ddp_model: Module | None = None) None[source]¶
Trains the model.
This method contains the main training loop, including epoch iteration, validation, early stopping logic, and model saving/exporting.
- Parameters:
train_loader – DataLoader for the training dataset.
valid_loader – DataLoader for the validation dataset.
ddp_model – ddp model
- sequifier.train.format_number(number: int | float | float32) str[source]¶
Format a number for display.
- Parameters:
number – The number to format.
- Returns:
A formatted string representation of the number.
- sequifier.train.infer_with_embedding_model(model: Module, x: list[dict[str, numpy.ndarray]], device: str, size: int, target_columns: list[str]) ndarray[source]¶
Performs inference with an embedding model.
- Parameters:
model – The loaded TransformerEmbeddingModel.
x – A list of input data dictionaries (batched).
device – The device to run inference on.
size – The total number of samples (unused in this function).
target_columns – List of target column names (unused in this function).
- Returns:
A NumPy array containing the concatenated embeddings from all batches.
- sequifier.train.infer_with_generative_model(model: Module, x: list[dict[str, numpy.ndarray]], device: str, size: int, target_columns: list[str]) dict[str, numpy.ndarray][source]¶
Performs inference with a generative model.
- Parameters:
model – The loaded TransformerModel.
x – A list of input data dictionaries (batched).
device – The device to run inference on.
size – The total number of samples to trim the final output to.
target_columns – List of target column names to extract from the output.
- Returns:
A dictionary mapping target column names to their concatenated output NumPy arrays, trimmed to size.
- sequifier.train.load_inference_model(model_type: str, model_path: str, training_config_path: str, args_config: dict[str, Any], device: str, infer_with_dropout: bool) Module[source]¶
Loads a trained model for inference.
- Parameters:
model_type – “generative” or “embedding”.
model_path – Path to the saved .pt model file.
training_config_path – Path to the .yaml config file used for training.
args_config – A dictionary of override configurations.
device – The device to load the model onto (e.g., “cuda”, “cpu”).
infer_with_dropout – Whether to force dropout layers to be active during inference.
- Returns:
The loaded and compiled torch.nn.Module (TransformerModel or TransformerEmbeddingModel) in evaluation mode.
- sequifier.train.train(args: Any, args_config: dict[str, Any]) None[source]¶
The main training function.
- Parameters:
args – The command-line arguments.
args_config – The configuration dictionary.
- sequifier.train.train_worker(local_rank: int, world_size: int, config: TrainModel, from_folder: bool, global_rank: int, torch_compile: str)[source]¶
The worker function for distributed training.
- Parameters:
rank – The rank of the current process.
world_size – The total number of processes.
config – The training configuration.
from_folder – Whether to load data from a folder (e.g., preprocessed .pt files) or a single file (e.g., .parquet).
global_rank – The global rank
- class sequifier.infer.Inferer(model_type: str, model_path: str, project_root: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], input_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, prediction_length: int, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]¶
A class for performing inference with a trained sequifier model.
This class encapsulates the model (either ONNX session or PyTorch model), normalization statistics, ID mappings, and all configuration needed to run inference. It provides methods to handle batching, model-specific inference calls (PyTorch vs. ONNX), and post-processing (like inverting normalization).
- model_type¶
‘generative’ or ‘embedding’.
- map_to_id¶
Whether to map integer predictions back to original IDs.
- selected_columns_statistics¶
Dict of ‘mean’ and ‘std’ for real columns.
- index_map¶
The inverse of id_maps, for mapping indices back to values.
- device¶
The device (‘cuda’ or ‘cpu’) for inference.
- target_columns¶
List of columns the model predicts.
- target_column_types¶
Dict mapping target columns to ‘categorical’ or ‘real’.
- inference_model_type¶
‘onnx’ or ‘pt’.
- ort_session¶
onnxruntime.InferenceSession if using ONNX.
- inference_model¶
The loaded PyTorch model if using ‘pt’.
- __init__(model_type: str, model_path: str, project_root: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], input_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, prediction_length: int, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]¶
Initializes the Inferer.
- Parameters:
model_type – The type of model to use for inference.
model_path – The path to the trained model.
project_root – The path to the sequifier project directory.
id_maps – A dictionary of id maps for categorical columns.
selected_columns_statistics – A dictionary of statistics for numerical columns.
map_to_id – Whether to map the output to the original ids.
categorical_columns – A list of categorical columns.
real_columns – A list of real columns.
selected_columns – A list of selected columns.
target_columns – A list of target columns.
target_column_types – A dictionary of target column types.
sample_from_distribution_columns – A list of columns to sample from the distribution.
infer_with_dropout – Whether to use dropout during inference.
inference_batch_size – The batch size for inference.
device – The device to use for inference.
args_config – The command-line arguments.
training_config_path – The path to the training configuration file.
- adjust_and_infer_embedding(x: dict[str, numpy.ndarray], size: int)[source]¶
Handles batching and backend-specific calls for embedding inference.
This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx).
- Parameters:
x – The complete dictionary of input features (NumPy arrays).
size – The total number of samples in x, used to truncate any padding added for batching.
- Returns:
A NumPy array of embeddings, concatenated from all batches.
- adjust_and_infer_generative(x: dict[str, numpy.ndarray], size: int)[source]¶
Handles batching and backend-specific calls for generative inference.
This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx). It aggregates the results from all batches.
- Parameters:
x – The complete dictionary of input features (NumPy arrays).
size – The total number of samples in x, used to truncate any padding added for batching.
- Returns:
A dictionary mapping target column names to NumPy arrays of raw model outputs (logits or real values).
- expand_to_batch_size(x: ndarray) ndarray[source]¶
Pads a NumPy array to match self.inference_batch_size.
Repeats samples from x until the array’s first dimension is equal to self.inference_batch_size.
- Parameters:
x – The input NumPy array to pad.
- Returns:
A new NumPy array of size self.inference_batch_size in the first dimension.
- infer_embedding(x: dict[str, numpy.ndarray]) ndarray[source]¶
Performs inference with an embedding model.
This is a high-level wrapper that calls adjust_and_infer_embedding to handle batching and model-specific logic.
- Parameters:
x – A dictionary mapping feature names to NumPy arrays. All arrays must have the same first dimension (batch size).
- Returns:
A 2D NumPy array of the resulting embeddings.
- infer_generative(x: dict[str, numpy.ndarray] | None, probs: dict[str, numpy.ndarray] | None = None, return_probs: bool = False) dict[str, numpy.ndarray][source]¶
Performs generative inference, returning probabilities or predictions.
This function orchestrates the generative inference process. 1. If probs are not provided, it calls adjust_and_infer_generative
to get the raw model output (logits or real values) using x.
If return_probs is True: - It normalizes the logits for categorical columns to get
probabilities (using softmax, implemented in normalize).
It returns a dictionary of probabilities (for categorical) and raw predicted values (for real).
If return_probs is False (default): - It converts the model outputs (either from x or probs) into
final predictions.
For categorical columns, it either takes the argmax or samples from the distribution (sample_with_cumsum).
For real columns, it returns the value as-is.
- Parameters:
x – A dictionary mapping feature names to NumPy arrays. Required if probs is not provided.
probs – An optional dictionary of probabilities/logits. If provided, this skips the model inference step.
return_probs – If True, returns normalized probabilities for categorical targets. If False, returns final class predictions (via argmax or sampling).
- Returns:
A dictionary mapping target column names to NumPy arrays. The content of the arrays depends on return_probs.
- infer_pure(x: dict[str, numpy.ndarray]) list[numpy.ndarray][source]¶
Performs a single inference pass using the ONNX session.
This function assumes x is already a single, correctly-sized batch. It formats the input dictionary to match the ONNX model’s input names and executes self.ort_session.run().
- Parameters:
x – A dictionary of feature arrays for a single batch. This batch must be of size self.inference_batch_size.
- Returns:
A list of NumPy arrays, representing the raw outputs from the ONNX model.
- invert_normalization(values: ndarray, target_column: str) ndarray[source]¶
Inverts Z-score normalization for a given target column.
Uses the ‘mean’ and ‘std’ stored in self.selected_columns_statistics to transform normalized values back to their original scale.
- Parameters:
values – A NumPy array of normalized values.
target_column – The name of the column whose statistics should be used for the inverse transformation.
- Returns:
A NumPy array of values in their original scale.
- prepare_inference_batches(x: dict[str, numpy.ndarray], pad_to_batch_size: bool) list[dict[str, numpy.ndarray]][source]¶
Splits input data into batches for inference.
This function takes a large dictionary of feature arrays and splits them into a list of smaller dictionaries (batches) of size self.inference_batch_size.
- Parameters:
x – A dictionary of feature arrays.
pad_to_batch_size – If True (for ONNX), the last batch will be padded up to self.inference_batch_size by repeating samples. If False (for PyTorch), the last batch may be smaller.
- Returns:
A list of dictionaries, where each dictionary is a single batch ready for inference.
- sequifier.infer.fill_number(number: int | float, max_length: int) str[source]¶
Pads a number with leading zeros to a specified string length.
Used for creating sortable string keys (e.g., “001-001”, “001-002”).
- Parameters:
number – The integer or float to format.
max_length – The total desired length of the output string.
- Returns:
A string representation of the number, padded with leading zeros.
- sequifier.infer.get_embeddings(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, torch.dtype]) ndarray[source]¶
Generates embeddings from a Polars DataFrame.
This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It uses numpy_to_pytorch for the main conversion, then transforms the tensors to NumPy arrays before passing them to inferer.infer_embedding.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
data – The input Polars DataFrame chunk.
column_types – A dictionary mapping column names to torch.dtype.
- Returns:
A NumPy array containing the computed embeddings for the batch.
- sequifier.infer.get_embeddings_pt(config: Any, inferer: Inferer, data: dict[str, torch.Tensor]) ndarray[source]¶
Generates embeddings from a batch of PyTorch tensor data.
This function serves as a wrapper for Inferer.infer_embedding when the input data is already in PyTorch tensor format (from loading .pt files which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It converts the tensor dictionary to a NumPy array dictionary before passing it to the inferer.
- Parameters:
config – The InfererModel configuration object (unused, but kept for consistent function signature).
inferer – The initialized Inferer instance.
data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).
- Returns:
A NumPy array containing the computed embeddings for the batch.
- sequifier.infer.get_probs_preds_autoregression(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, torch.dtype], seq_length: int) tuple[Optional[dict[str, numpy.ndarray]], dict[str, numpy.ndarray], numpy.ndarray, numpy.ndarray][source]¶
Generates autoregressive predictions and aligns them with sequence IDs and positions.
Extracts the initial sequence context from the sorted input DataFrame, maps it to PyTorch tensors, and executes step-by-step autoregressive inference.
- Parameters:
config – Inference configuration object.
inferer – Initialized Inferer instance.
data – Input DataFrame, sorted globally by sequenceId and locally by subsequenceId.
column_types – Mapping of input column names to their torch.dtype.
seq_length – Length of the input sequence context.
- Returns:
probs: Dict of probability arrays per target column (None if disabled).
preds: Dict of final prediction arrays per target column.
sequence_ids_for_preds: 1D array of sequence IDs matching the output shape.
item_positions_for_preds: 1D array of absolute item positions for each step.
- Return type:
A tuple containing
- sequifier.infer.get_probs_preds_from_df(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, torch.dtype]) tuple[Optional[dict[str, numpy.ndarray]], dict[str, numpy.ndarray]][source]¶
Generates predictions from a Polars DataFrame (non-autoregressive).
This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It’s used for standard, non-autoregressive generative inference. It calls inferer.infer_generative once and returns the probabilities (if requested) and predictions.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
data – The input Polars DataFrame chunk.
column_types – A dictionary mapping column names to torch.dtype.
- Returns:
probs: A dictionary mapping target columns to NumPy arrays of probabilities, or None if config.output_probabilities is False.
preds: A dictionary mapping target columns to NumPy arrays of final predictions.
- Return type:
A tuple (probs, preds)
- sequifier.infer.get_probs_preds_from_dict(config: Any, inferer: Inferer, data: dict[str, torch.Tensor], total_steps: int = 1) tuple[Optional[dict[str, numpy.ndarray]], dict[str, numpy.ndarray]][source]¶
Generates predictions from PyTorch tensor data, supporting autoregression.
This function performs generative inference on a batch of PyTorch tensor data loaded from .pt files (which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It implements an autoregressive loop: 1. Runs inference on the initial data X (sequences). 2. For each subsequent step:
Creates the next input X_next by shifting the previous input X and appending the prediction from the last step.
Runs inference on X_next.
Collects and reshapes all predictions and probabilities from all steps into a single flat batch, ordered by original sample index, then by step.
- Parameters:
config – The InfererModel configuration object, used to check output_probabilities and input_columns.
inferer – The initialized Inferer instance.
data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).
total_steps – The number of total autoregressive steps to perform. A value of 1 means simple, non-autoregressive inference.
- Returns:
probs: A dictionary mapping target columns to NumPy arrays of probabilities, ordered by sample index then step, or None if config.output_probabilities is False.
preds: A dictionary mapping target columns to NumPy arrays of final predictions, ordered by sample index then step.
- Return type:
A tuple (probs, preds)
- sequifier.infer.infer(args: Any, args_config: dict[str, Any]) None[source]¶
Runs the main inference pipeline.
This function orchestrates the inference process. It loads the main inference configuration, retrieves necessary metadata like ID maps and column statistics from a metadata_config file (if required for mapping or normalization), and then delegates the core work to the infer_worker function.
- Parameters:
args – Command-line arguments, typically from argparse. Expected to have attributes like config_path and skip_metadata.
args_config – A dictionary of configuration overrides, often passed from the command line, that will be merged into the loaded configuration file.
- sequifier.infer.infer_embedding(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, torch.dtype]) None[source]¶
Performs inference with an embedding model and saves the results.
This function iterates through the provided dataset (which can be a list of DataFrames or an iterator of tensors). For each data chunk, it calls the appropriate function (get_embeddings or get_embeddings_pt) to generate embeddings. It then formats these embeddings into a Polars DataFrame, associating them with their sequenceId, subsequenceId, and absolute itemPosition, and writes the resulting DataFrame to the configured output path.
- Parameters:
config – The InfererModel configuration object.
inferer – The initialized Inferer instance.
model_id – A string identifier for the model, used for naming output files.
dataset – A list containing a Polars DataFrame (for parquet/csv) or an iterator of loaded PyTorch data (for .pt files).
column_types – A dictionary mapping column names to their torch.dtype.
- sequifier.infer.infer_generative(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, torch.dtype])[source]¶
Executes the generative inference pipeline and exports results to disk.
This function processes the input dataset in chunks to accommodate large data volumes. It handles various input formats (standalone CSV/Parquet, folder-based Parquet, or PyTorch tensors) and routes the data to the appropriate inference logic (standard sequence prediction or step-by-step autoregression). After obtaining raw model outputs, it calculates aligned sequence IDs and absolute item positions, applies necessary post-processing (such as reverse-mapping categorical IDs and denormalizing real values), and writes the final probabilities and predictions to the configured output directory.
- Parameters:
config – The inference configuration object dictating I/O paths, autoregression settings, and output formats.
inferer – The initialized Inferer instance responsible for executing the underlying model logic.
model_id – A string identifier for the current model, used to construct the names of the generated output files and directories.
dataset – A list or iterator yielding data chunks, typically containing either Polars DataFrames or PyTorch tensor dictionaries.
column_types – A dictionary mapping input column names to their expected torch.dtype.
- sequifier.infer.infer_worker(config: Any, args_config: dict[str, Any], id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], percentage_limits: tuple[float, float] | None)[source]¶
Core worker function that performs inference.
This function handles the main workflow: 1. Loads the dataset based on config.read_format (parquet, csv, or pt). 2. Iterates over one or more model paths specified in the config. 3. For each model, initializes an Inferer object with all necessary
configurations, mappings, and statistics.
Calls the appropriate inference function (infer_generative or infer_embedding) based on the config.model_type.
Manages the data iterators and passes data chunks to the inference functions.
- Parameters:
config – The fully resolved InfererModel configuration object.
args_config – A dictionary of command-line arguments, passed to the Inferer for potential model loading overrides.
id_maps – A nested dictionary mapping categorical column names to their value-to-index maps. None if map_to_id is False.
selected_columns_statistics – A nested dictionary containing ‘mean’ and ‘std’ for real-valued columns used for normalization.
percentage_limits – A tuple (start_pct, end_pct) used only when config.read_format == “pt” to slice the dataset.
- sequifier.infer.load_parquet_folder_dataset(data_path: str, start_pct: float, end_pct: float) Iterator[Any][source]¶
Lazily loads and yields data from long-format .parquet chunk files in a directory.
This function scans a directory for .parquet files, sorts them, and then yields the contents of a specific slice of those files defined by a start and end percentage. This allows for processing large datasets in chunks without loading everything into memory.
- Parameters:
data_path – The path to the folder containing the .parquet files.
start_pct – The starting percentage (0.0 to 100.0) of the file list to begin loading from.
end_pct – The ending percentage (0.0 to 100.0) of the file list to stop loading at.
- Yields:
Iterator – An iterator where each item is a Polars DataFrame loaded from a single .parquet file.
- sequifier.infer.load_pt_dataset(data_path: str, start_pct: float, end_pct: float) Iterator[Any][source]¶
Lazily loads and yields data from .pt files in a directory.
This function scans a directory for .pt files, sorts them, and then yields the contents of a specific slice of those files defined by a start and end percentage. This allows for processing large datasets in chunks without loading everything into memory.
- Parameters:
data_path – The path to the folder containing the .pt files.
start_pct – The starting percentage (0.0 to 100.0) of the file list to begin loading from.
end_pct – The ending percentage (0.0 to 100.0) of the file list to stop loading at.
- Yields:
Iterator – An iterator where each item is the data loaded from a single .pt file (e.g., using torch.load).
- sequifier.infer.normalize(outs: dict[str, numpy.ndarray]) dict[str, numpy.ndarray][source]¶
Applies the softmax function to a dictionary of logits.
Converts raw model logits for categorical columns into probabilities that sum to 1.
- Parameters:
outs – A dictionary mapping target column names to NumPy arrays of logits.
- Returns:
A dictionary mapping the same target column names to NumPy arrays of probabilities.
- sequifier.infer.sample_with_cumsum(probs: ndarray, is_log_probs: bool = True) ndarray[source]¶
Samples from a probability distribution using the inverse CDF method.
Takes an array of logits, computes the cumulative probability distribution, draws a random number r from [0, 1), and returns the index of the first class i where cumsum[i] > r.
- Parameters:
probs – A 2D NumPy array of logits or normalized probabilities. Shape is (batch_size, num_classes).
is_log_probs – Boolean flag indicating if the passed array are logits or probabilities
- Returns:
A 1D NumPy array of shape (batch_size,) containing the sampled class indices.
- sequifier.infer.verify_variable_order(data: DataFrame) None[source]¶
Verifies that the DataFrame is correctly sorted for autoregression.
Checks two conditions: 1. sequenceId is globally sorted in ascending order. 2. subsequenceId is sorted in ascending order within each
sequenceId group.
- Parameters:
data – The Polars DataFrame to check.
- Raises:
AssertionError – If sequenceId is not globally sorted or if subsequenceId is not sorted within sequenceId groups.
- sequifier.make.make(args)[source]¶
Creates a new sequifier project.
- Parameters:
args – The command-line arguments.
- sequifier.hyperparameter_search.hyperparameter_search(config_path: str, skip_metadata: bool) None[source]¶
Main function for initiating an Optuna-based hyperparameter search process.
This function loads the configuration, initializes the Optuna study with a minimization direction, and kicks off the optimization loop. Once the configured number of trials is complete, it prints out the best trial’s value and hyperparameters.
- Parameters:
config_path (str) – Path to the hyperparameter search YAML configuration file.
skip_metadata (bool) – Flag indicating whether to skip loading/processing data metadata.
- Raises:
ValueError – If n_trials is not defined in the configuration.
- sequifier.hyperparameter_search.objective(trial: Trial, config) float | tuple[float, ...][source]¶
The central objective engine bridging Optuna to pure CLI execution.
This function handles generating the YAML configuration for the specific trial, dynamically allocating a port for distributed training, launching the training subprocess, asynchronously polling the validation metrics, and reporting them back to Optuna for potential pruning.
- Parameters:
trial (optuna.Trial) – The Optuna trial object managing the current hyperparameter combination.
config (HyperparameterSearchConfig) – The parsed hyperparameter search configuration.
- Returns:
The best validation loss achieved during the trial.
- Return type:
float
- Raises:
optuna.TrialPruned – If the trial is pruned by the Optuna orchestrator.
RuntimeError – If the training subprocess fails or is externally preempted.
- sequifier.hyperparameter_search.set_pdeathsig()[source]¶
Binds child process lifecycle to the parent orchestrator via Linux prctl.
- sequifier.helpers.configure_determinism(seed: int, strict: bool = False) None[source]¶
Enforces deterministic execution for reproducibility.
- sequifier.helpers.configure_logger(project_root: str, model_name: str, rank: int | None = 0)[source]¶
Configures Loguru to replicate the legacy LogFile behavior.
Legacy Behavior Mapping: 1. Console: Only Rank 0 prints high-level info. 2. File 2 (Detailed): Captures ALL logs (equivalent to old level 2). 3. File 3 (Summary): Captures only HIGH importance logs (equivalent to old level 3). 4. Formatting: Files contain raw messages only (no timestamp prefix).
- sequifier.helpers.construct_index_maps(id_maps: dict[str, dict[Union[str, int], int]] | None, target_columns_index_map: list[str], map_to_id: bool | None) dict[str, dict[int, Union[str, int]]][source]¶
Constructs reverse index maps (int index to original ID).
This function creates reverse mappings from the integer indices back to the original string or integer identifiers. It only performs this operation if map_to_id is True and id_maps is provided.
A special mapping for index 0 is added: - If original IDs are strings, 0 maps to “unknown”. - If original IDs are strings, 1 maps to “other”. - If original IDs are integers, 0 maps to (minimum original ID) - 2. - If original IDs are integers, 1 maps to (minimum original ID) - 1.
- Parameters:
id_maps – A nested dictionary mapping column names to their respective ID-to-index maps (e.g., {‘col_name’: {‘original_id’: 1, …}}). Expected to be provided if map_to_id is True.
target_columns_index_map – A list of column names for which to construct the reverse maps.
map_to_id – A boolean flag. If True, the reverse maps are constructed. If False or None, an empty dictionary is returned.
- Returns:
A dictionary where keys are column names from target_columns_index_map and values are the reverse maps (index-to-original-ID). Returns an empty dict if map_to_id is not True.
- Raises:
AssertionError – If map_to_id is True but id_maps is None.
AssertionError – If the values of a map are not consistently string or integer (excluding the added ‘0’ key).
- sequifier.helpers.get_best_model_path(project_root: str, run_name: str, model_type: str) tuple[str, int][source]¶
Searches for the exported ‘best’ model file for a given run and returns its path and epoch.
- Parameters:
project_root – The root directory of the project.
run_name – The unique identifier for the hyperparameter search run.
model_type – The extension of the exported model (e.g., ‘onnx’ or ‘pt’).
- Returns:
The file path to the best model (str).
The actual epoch at which this model was saved (int).
- Return type:
A tuple containing
- Raises:
FileNotFoundError – If no matching model files are found.
- sequifier.helpers.get_last_training_batch_timedelta(model_name: str, rank: int, project_root: str = '.') float[source]¶
Reads the level 2 log file, finds the last two mid-epoch training logs, and returns the timedelta between them in seconds.
- sequifier.helpers.get_torch_dtype(dtype_str: str) dtype[source]¶
Converts a string to a torch dtype, supporting bfloat16 and fp8.
- sequifier.helpers.normalize_path(path: str, project_root: str) str[source]¶
Normalizes a path to be relative to a project path, then joins them.
This function ensures that a given path is correctly expressed as an absolute path rooted at project_root. It does this by first removing the project_root prefix from path (if it exists) and then joining the result back to project_root.
This is useful for handling paths that might be provided as either relative (e.g., “data/file.txt”) or absolute (e.g., “/abs/path/to/project/data/file.txt”).
- Parameters:
path – The path to normalize.
project_root – The absolute path to the project’s root directory.
- Returns:
A normalized, absolute path.
- sequifier.helpers.numpy_to_pytorch(data: DataFrame, column_types: dict[str, torch.dtype], all_columns: list[str], seq_length: int) dict[str, torch.Tensor][source]¶
Converts a long-format Polars DataFrame to a dict of sequence tensors.
This function assumes the input DataFrame data is in a long format where each row represents a sequence for a specific feature. It expects a column named “inputCol” that contains the feature name (e.g., ‘price’, ‘volume’) and other columns representing time steps (e.g., “0”, “1”, …, “L”).
It generates two tensors for each column in all_columns: 1. An “input” tensor (from time steps L down to 1). 2. A “target” tensor (from time steps L-1 down to 0).
Example
For seq_length = 3 and all_columns = [‘price’], it will create: - ‘price’: Tensor from columns [“3”, “2”, “1”] - ‘price_target’: Tensor from columns [“2”, “1”, “0”]
- Parameters:
data – The long-format Polars DataFrame. Must contain “inputCol” and columns named as strings of integers for time steps.
column_types – A dictionary mapping feature names (from “inputCol”) to their desired torch.dtype.
all_columns – A list of all feature names (from “inputCol”) to be processed and converted into tensors.
seq_length – The total sequence length (L). This determines the column names for time steps (e.g., “0” to “L”).
- Returns:
A dictionary mapping feature names to their corresponding PyTorch tensors. Target tensors are stored with a _target suffix (e.g., {‘price’: <tensor>, ‘price_target’: <tensor>}).
- sequifier.helpers.read_data(path: str, read_format: str, columns: list[str] | None = None) DataFrame[source]¶
Reads data from a CSV or Parquet file into a Polars DataFrame.
- Parameters:
path – The file path to read from.
read_format – The format of the file. Supported formats are “csv” and “parquet”.
columns – An optional list of column names to read. This argument is only used when read_format is “parquet”.
- Returns:
A Polars DataFrame containing the data from the file.
- Raises:
ValueError – If read_format is not “csv” or “parquet”.
- sequifier.helpers.subset_to_input_columns(data: DataFrame | LazyFrame, input_columns: list[str]) DataFrame | LazyFrame[source]¶
Filters a DataFrame to rows where ‘inputCol’ is in a list of column_names.
This function supports both Polars (DataFrame, LazyFrame) and Pandas DataFrames, dispatching to the appropriate filtering method.
For Polars objects, it uses data.filter(pl.col(“inputCol”).is_in(…)).
For other objects (presumably Pandas), it builds a numpy boolean mask and filters using data.loc[…].
Note: The type hint only specifies Polars objects, but the implementation includes a fallback path for Pandas-like objects.
- Parameters:
data – The Polars (DataFrame, LazyFrame) or Pandas DataFrame to filter. It must contain a column named “inputCol”.
input_columns – A list of values. Rows will be kept if their value in “inputCol” is present in this list.
- Returns:
A filtered DataFrame or LazyFrame of the same type as the input.
- sequifier.helpers.write_data(data: DataFrame, path: str, write_format: str, **kwargs) None[source]¶
Writes a Polars (or Pandas) DataFrame to a CSV or Parquet file.
This function detects the type of the input DataFrame. - For Polars DataFrames, it uses .write_csv() or .write_parquet(). - For other DataFrame types (presumably Pandas), it uses .to_csv()
or .to_parquet().
Note: The type hint specifies pl.DataFrame, but the implementation includes a fallback path that suggests compatibility with Pandas DataFrames.
- Parameters:
data – The Polars (or Pandas) DataFrame to write.
path – The destination file path.
write_format – The format to write. Supported formats are “csv” and “parquet”.
**kwargs – Additional keyword arguments passed to the underlying write function (e.g., write_csv for Polars, to_csv for Pandas).
- Returns:
None.
- Raises:
ValueError – If write_format is not “csv” or “parquet”.
- class sequifier.io.yaml.TrainModelDumper(stream, default_style=None, default_flow_style=False, canonical=None, indent=None, width=None, allow_unicode=None, line_break=None, encoding=None, explicit_start=None, explicit_end=None, version=None, tags=None, sort_keys=True)[source]¶
A custom YAML dumper for TrainModel objects.
This dumper extends the base yaml.Dumper to provide custom serialization for TrainModel and related objects, ensuring a clean and readable YAML output. It also modifies the indentation behavior for better formatting.
- increase_indent(flow=False, indentless=False)[source]¶
Increase the indentation level for the YAML output.
This method overrides the default behavior to force indentation for all block-style collections, improving the readability of the output YAML.
- Parameters:
flow – Whether the context is a flow-style collection.
indentless – Whether the context is an indentless sequence.
- Returns:
The result of the parent class’s increase_indent method, with flow forced to False.
- sequifier.io.yaml.represent_dot_dict(dumper, data)[source]¶
Represents DotDict objects as a simple YAML mapping. The original output showed a ‘dictitems’ attribute. If your DotDict is essentially a dictionary, this will work.
- sequifier.io.yaml.represent_numpy_float(dumper, data)[source]¶
Represents numpy.float64 (and similar numpy floats) as standard YAML floats.
- sequifier.io.yaml.represent_numpy_int(dumper, data)[source]¶
Represents numpy.int64 (and similar numpy integers) as standard YAML integers.
- sequifier.io.yaml.represent_sequifier_object(dumper, data)[source]¶
Represents objects from ‘sequifier.config.train_config’ (like TrainModel, ModelSpecModel, TrainingSpecModel) as a simple YAML mapping, using the object’s __dict__. This effectively removes the !!python/object tag and the explicit ‘__dict__:’, ‘__fields_set__:’ keys.
- class sequifier.io.sequifier_dataset_from_file.SequifierDatasetFromFile(data_path: str, config: TrainModel, shuffle: bool = True)[source]¶
An iterable-style dataset that pre-loads all data into CPU RAM and yields pre-collated batches.
This is the idiomatic PyTorch solution for implementing custom ‘en block’ batching. The __iter__ method handles shuffling and batch slicing, ensuring maximum performance.
- __iter__() Iterator[Tuple[Dict[str, Tensor], Dict[str, Tensor], None, None, None]][source]¶
Yields batches of data.
Handles shuffling (if enabled) and slicing data based on distributed rank and worker ID.
- Yields:
Iterator[Tuple[Dict[str, torch.Tensor], Dict[str, torch.Tensor], None, None, None]] –
- An iterator where each item is a tuple containing:
data_batch (dict): Dictionary of feature tensors for the batch.
targets_batch (dict): Dictionary of target tensors for the batch.
None: Placeholder for sequence_id (not used in this dataset type).
None: Placeholder for subsequence_id (not used in this dataset type).
None: Placeholder for start_position (not used in this dataset type).