This page contains the auto-generated API reference documentation.
Preprocessing Config¶
- class sequifier.config.preprocess_config.PreprocessorModel(*, project_path: str, data_path: str, read_format: str = 'csv', write_format: str = 'parquet', combine_into_single_file: bool = True, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int] | None, max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int = 1024, process_by_file: bool = True)[source]¶
Pydantic model for preprocessor configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- data_path¶
The path to the input data file.
- Type:
str
- read_format¶
The file type of the input data. Can be ‘csv’ or ‘parquet’.
- Type:
str
- write_format¶
The file type for the preprocessed output data.
- Type:
str
- combine_into_single_file¶
If True, combines all preprocessed data into a single file.
- Type:
bool
- selected_columns¶
A list of columns to be included in the preprocessing. If None, all columns are used.
- Type:
list[str] | None
- group_proportions¶
A list of floats that define the relative sizes of data splits (e.g., for train, validation, test). The sum of proportions must be 1.0.
- Type:
list[float]
- seq_length¶
The sequence length for the model inputs.
- Type:
int
- seq_step_sizes¶
A list of step sizes for creating subsequences within each data split.
- Type:
list[int] | None
- max_rows¶
The maximum number of input rows to process. If None, all rows are processed.
- Type:
int | None
- seed¶
A random seed for reproducibility.
- Type:
int
- n_cores¶
The number of CPU cores to use for parallel processing. If None, it uses the available CPU cores.
- Type:
int | None
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- process_by_file¶
A flag to indicate if processing should be done file by file.
- Type:
bool
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Training Config¶
Inference Config¶
- class sequifier.config.infer_config.InfererModel(*, project_path: str, ddconfig_path: str, model_path: str | list[str], model_type: str, data_path: str, training_config_path: str = 'configs/train.yaml', read_format: str = 'parquet', write_format: str = 'csv', selected_columns: list[str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], column_types: dict[str, str], target_column_types: dict[str, str], output_probabilities: bool = False, map_to_id: bool = True, seed: int, device: str, seq_length: int, inference_batch_size: int, distributed: bool = False, load_full_data_to_ram: bool = True, world_size: int = 1, num_workers: int = 0, sample_from_distribution_columns: list[str] | None = None, infer_with_dropout: bool = False, autoregression: bool = False, autoregression_extra_steps: int | None = None)[source]¶
Pydantic model for inference configuration.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- ddconfig_path¶
The path to the data-driven configuration file.
- Type:
str
- model_path¶
The path to the trained model file(s).
- Type:
str | list[str]
- model_type¶
The type of model, either ‘embedding’ or ‘generative’.
- Type:
str
- data_path¶
The path to the data to be used for inference.
- Type:
str
- training_config_path¶
The path to the training configuration file.
- Type:
str
- read_format¶
The file format of the input data (e.g., ‘csv’, ‘parquet’).
- Type:
str
- write_format¶
The file format for the inference output.
- Type:
str
- selected_columns¶
The list of input columns used for inference.
- Type:
list[str]
- categorical_columns¶
A list of columns that are categorical.
- Type:
list[str]
- real_columns¶
A list of columns that are real-valued.
- Type:
list[str]
- target_columns¶
The list of target columns for inference.
- Type:
list[str]
- column_types¶
A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).
- Type:
dict[str, str]
- target_column_types¶
A dictionary mapping target columns to their types (‘categorical’ or ‘real’).
- Type:
dict[str, str]
- output_probabilities¶
If True, outputs the probability distributions for categorical target columns.
- Type:
bool
- map_to_id¶
If True, maps categorical output values back to their original IDs.
- Type:
bool
- seed¶
The random seed for reproducibility.
- Type:
int
- device¶
The device to run inference on (e.g., ‘cuda’, ‘cpu’, ‘mps’).
- Type:
str
- seq_length¶
The sequence length of the model’s input.
- Type:
int
- inference_batch_size¶
The batch size for inference.
- Type:
int
- distributed¶
If True, enables distributed inference.
- Type:
bool
- load_full_data_to_ram¶
If True, loads the entire dataset into RAM.
- Type:
bool
- world_size¶
The number of processes for distributed inference.
- Type:
int
- num_workers¶
The number of worker threads for data loading.
- Type:
int
- sample_from_distribution_columns¶
A list of columns from which to sample from the distribution.
- Type:
list[str] | None
- infer_with_dropout¶
If True, applies dropout during inference.
- Type:
bool
- autoregression¶
If True, performs autoregressive inference.
- Type:
bool
- autoregression_extra_steps¶
The number of additional steps for autoregressive inference.
- Type:
int | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Hyperparameter Search Config¶
Non-standard Optimizers¶
- class sequifier.optimizers.ademamix.AdEMAMix(params={}, lr=0.001, betas=(0.9, 0.999, 0.9999), eps=1e-08, weight_decay=0, alpha=5.0, T_alpha_beta3=None)[source]¶
Implements the AdEMAMix optimizer.
This optimizer is based on the paper “AdEMAMix: A Novel Adaptive Optimizer for Deep Learning”. It combines the advantages of Adam and EMA, and introduces a mixing term to further improve performance.
- Parameters:
params (iterable) – Iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – Learning rate (default: 1e-3).
betas (Tuple[float, float, float], optional) – Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999, 0.9999)).
eps (float, optional) – Term added to the denominator to improve numerical stability (default: 1e-8).
weight_decay (float, optional) – Weight decay (L2 penalty) (default: 0).
alpha (float, optional) – Mixing coefficient (default: 5.0).
T_alpha_beta3 (int, optional) – Time period for alpha and beta3 scheduling (default: None).
Internals¶
- class sequifier.preprocess.Preprocessor(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]¶
A class for preprocessing data for the sequifier model.
This class handles loading, preprocessing, and saving data. It supports single-file and multi-file processing, and can handle large datasets by processing them in batches.
- project_path¶
The path to the sequifier project directory.
- Type:
str
- batches_per_file¶
The number of batches to process per file.
- Type:
int
- data_name_root¶
The root name of the data file.
- Type:
str
- combine_into_single_file¶
Whether to combine the output into a single file.
- Type:
bool
- target_dir¶
The target directory for temporary files.
- Type:
str
- seed¶
The random seed for reproducibility.
- Type:
int
- n_cores¶
The number of cores to use for parallel processing.
- Type:
int
- split_paths¶
The paths to the output split files.
- Type:
list[str]
- __init__(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]¶
Initializes the Preprocessor with the given parameters.
- Parameters:
project_path – The path to the sequifier project directory.
data_path – The path to the input data file.
read_format – The file type of the input data.
write_format – The file type for the preprocessed output data.
combine_into_single_file – Whether to combine the output into a single file.
selected_columns – A list of columns to be included in the preprocessing.
group_proportions – A list of floats that define the relative sizes of data splits.
seq_length – The sequence length for the model inputs.
seq_step_sizes – A list of step sizes for creating subsequences.
max_rows – The maximum number of input rows to process.
seed – A random seed for reproducibility.
n_cores – The number of CPU cores to use for parallel processing.
batches_per_file – The number of batches to process per file.
process_by_file – A flag to indicate if processing should be done file by file.
- sequifier.preprocess.cast_columns_to_string(data: DataFrame) DataFrame[source]¶
Casts the column names of a Polars DataFrame to strings.
This is often necessary because Polars schemas may use integers as column names (e.g., ‘0’, ‘1’, ‘2’…) which need to be strings for some operations.
- Parameters:
data – The Polars DataFrame.
- Returns:
The same DataFrame with its columns attribute modified.
- sequifier.preprocess.combine_maps(map1: dict[str | int, int], map2: dict[str | int, int]) dict[str | int, int][source]¶
Combines two ID maps into a new, consolidated map.
Takes all unique keys from both map1 and map2, sorts them, and creates a new, single map where keys are mapped to 1-based indices based on the sorted order. This ensures a consistent mapping across different data chunks.
- Parameters:
map1 – The first ID map.
map2 – The second ID map.
- Returns:
A new, combined, and re-indexed ID map.
- sequifier.preprocess.combine_multiprocessing_outputs(project_path: str, target_dir: str, n_splits: int, input_files: dict[int, list[str]], dataset_name: str, write_format: str, in_target_dir: bool = False, pre_split_str: str | None = None, post_split_str: str | None = None) None[source]¶
Combines multiple intermediate batch files into final split files.
This function iterates through each split and combines all the intermediate files listed in input_files[split] into a single final output file for that split.
For “csv” format, it uses the csvstack command-line utility.
For “parquet” format, it uses pyarrow.parquet.ParquetWriter to concatenate the files efficiently.
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory containing intermediate files.
n_splits – The number of data splits.
input_files – A dictionary mapping split index (int) to a list of input file paths (str) for that split.
dataset_name – The root name for the final output files.
write_format – The file format (“csv” or “parquet”).
in_target_dir – If True, the final combined file is written inside target_dir. If False, it’s written to data/.
pre_split_str – An optional string to insert into the filename before the “-split{i}” part.
post_split_str – An optional string to insert into the filename after the “-split{i}” part.
- sequifier.preprocess.combine_parquet_files(files: list[str], out_path: str) None[source]¶
Combines multiple Parquet files into a single Parquet file.
This function reads the schema from the first file and uses it to initialize a ParquetWriter. It then iterates through all files in the list, reading each one as a table and writing it to the new combined file. This is more memory-efficient than reading all files into one large table first.
- Parameters:
files – A list of paths to the Parquet files to combine.
out_path – The path for the combined output Parquet file.
- sequifier.preprocess.create_file_paths_for_multiple_files1(project_path: str, target_dir: str, n_splits: int, n_batches: int, process_id: int, file_index: int, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a specific data file.
This is used in the multi-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches before they are combined.
The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}-{batch_id}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of batches created by the process.
process_id – The ID of the multiprocessing worker.
file_index – The index of the file being processed by this worker.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_file_paths_for_multiple_files2(project_path: str, target_dir: str, n_splits: int, n_processes: int, n_files: dict[int, int], dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of intermediate file paths for a multi-file run.
This is used in the multi-file, combine_into_single_file=True workflow. It generates the file paths for the combined files from each process, which are the inputs to the final combination step.
The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory where files are located.
n_splits – The number of data splits.
n_processes – The total number of multiprocessing workers.
n_files – A dictionary mapping process_id to the number of files that process handled.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of all intermediate combined file paths (str) for that split.
- sequifier.preprocess.create_file_paths_for_single_file(project_path: str, target_dir: str, n_splits: int, n_batches: int, dataset_name: str, write_format: str) dict[int, list[str]][source]¶
Creates a dictionary of temporary file paths for a single-file run.
This is used in the single-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches created by different processes before they are combined.
The naming pattern is: {dataset_name}-split{split}-{core_id}.{write_format}
- Parameters:
project_path – The path to the sequifier project directory.
target_dir – The temporary directory to place files in.
n_splits – The number of data splits.
n_batches – The number of processes (batches) running in parallel.
dataset_name – The root name of the dataset.
write_format – The file extension.
- Returns:
A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.
- sequifier.preprocess.create_id_map(data: DataFrame, column: str) dict[str | int, int][source]¶
Creates a map from unique values in a column to integer indices.
Finds all unique values in the specified column of the data DataFrame, sorts them, and creates a dictionary mapping each unique value to a 1-based integer index.
- Parameters:
data – The Polars DataFrame containing the column.
column – The name of the column to map.
- Returns:
A dictionary mapping unique values (str or int) to an integer index (starting from 1).
- sequifier.preprocess.delete_files(files: list[str] | dict[int, list[str]]) None[source]¶
Deletes a list of files from the filesystem.
- Parameters:
files – A list of file paths to delete, or a dictionary whose values are lists of file paths to delete.
- sequifier.preprocess.extract_sequences(data: DataFrame, schema: Any, seq_length: int, seq_step_size: int, columns: list[str]) DataFrame[source]¶
Extracts subsequences from a DataFrame of full sequences.
This function takes a DataFrame where each row contains all items for a single sequenceId. It iterates through each sequenceId, extracts all possible subsequences of seq_length using the specified seq_step_size, calculates the starting position of each subsequence within the original sequence, and formats them into a new, long-format DataFrame that conforms to the provided schema.
- Parameters:
data – The input Polars DataFrame, grouped by “sequenceId”.
schema – The schema for the output long-format DataFrame.
seq_length – The length of the subsequences to extract.
seq_step_size – The step size to use when sliding the window to create subsequences.
columns – A list of the data column names (features) to extract.
- Returns:
A new, long-format Polars DataFrame containing the extracted subsequences, matching the provided schema. Includes columns for sequenceId, subsequenceId, startItemPosition, inputCol, and the sequence items (‘0’, ‘1’, …).
- sequifier.preprocess.extract_subsequences(in_seq: dict[str, list], seq_length: int, seq_step_size: int, columns: list[str]) dict[str, list[list[float | int]]][source]¶
Extracts subsequences from a dictionary of sequence lists.
This function takes a dictionary in_seq where keys are column names and values are lists of items for a single full sequence. It first pads the sequences with 0s at the beginning if they are shorter than seq_length. Then, it calculates the subsequence start indices using get_subsequence_starts and extracts all subsequences.
- Parameters:
in_seq – A dictionary mapping column names to lists of items (e.g., {‘col_A’: [1, 2, 3, 4, 5], ‘col_B’: [6, 7, 8, 9, 10]}).
seq_length – The length of the subsequences to extract.
seq_step_size – The desired step size between subsequences.
columns – A list of the column names (keys in in_seq) to process.
- Returns:
A dictionary mapping column names to a list of lists, where each inner list is a subsequence.
- sequifier.preprocess.get_batch_limits(data: DataFrame, n_batches: int) list[tuple[int, int]][source]¶
Calculates row indices to split a DataFrame into batches.
This function divides the DataFrame into n_batches roughly equal chunks. Crucially, it ensures that no sequenceId is split across two different batches. It does this by finding the ideal split points and then adjusting them to the nearest sequenceId boundary.
- Parameters:
data – The DataFrame to split. Must be sorted by “sequenceId”.
n_batches – The desired number of batches.
- Returns:
A list of (start_index, end_index) tuples, where each tuple defines the row indices for a batch.
- sequifier.preprocess.get_combined_statistics(n1: int, mean1: float, std1: float, n2: int, mean2: float, std2: float) tuple[float, float][source]¶
Calculates the combined mean and standard deviation of two data subsets.
Uses a stable parallel algorithm (related to Welford’s algorithm) to combine statistics from two subsets without needing the original data.
- Parameters:
n1 – Number of samples in subset 1.
mean1 – Mean of subset 1.
std1 – Standard deviation of subset 1.
n2 – Number of samples in subset 2.
mean2 – Mean of subset 2.
std2 – Standard deviation of subset 2.
- Returns:
A tuple (combined_mean, combined_std) containing the combined mean and standard deviation of the two subsets.
- sequifier.preprocess.get_group_bounds(data_subset: DataFrame, group_proportions: list[float])[source]¶
Calculates row indices for splitting a sequence into groups.
This function takes a DataFrame data_subset (which typically contains all items for a single sequenceId) and calculates the row indices to split it into multiple groups (e.g., train, val, test) based on the provided group_proportions.
- Parameters:
data_subset – The DataFrame (for a single sequence) to split.
group_proportions – A list of floats (e.g., [0.8, 0.1, 0.1]) that sum to 1.0, defining the relative sizes of the splits.
- Returns:
A list of (start_index, end_index) tuples, one for each proportion, defining the row slices for each group.
- sequifier.preprocess.get_subsequence_starts(in_seq_length: int, seq_length: int, seq_step_size: int) ndarray[source]¶
Calculates the start indices for extracting subsequences.
This function determines the starting indices for sliding a window of seq_length over an input sequence of in_seq_length. It aims to use seq_step_size, but adjusts the step size slightly to ensure that the windows are distributed as evenly as possible and cover the full sequence from the beginning to the end.
- Parameters:
in_seq_length – The length of the original input sequence.
seq_length – The length of the subsequences to extract.
seq_step_size – The desired step size between subsequences.
- Returns:
A numpy array of integer start indices for each subsequence.
- sequifier.preprocess.insert_top_folder(path: str, folder_name: str) str[source]¶
Inserts a directory name into a file path, just before the filename.
Example
insert_top_folder(“a/b/c.txt”, “temp”) returns “a/b/temp/c.txt”
- Parameters:
path – The original file path.
folder_name – The name of the folder to insert.
- Returns:
The new path string with the folder inserted.
- sequifier.preprocess.preprocess(args: Any, args_config: dict[str, Any]) None[source]¶
Runs the main data preprocessing pipeline.
This function loads the preprocessing configuration, initializes the Preprocessor class, and executes the preprocessing steps based on the loaded configuration.
- Parameters:
args – An object containing command-line arguments. Expected to have a config_path attribute specifying the path to the YAML configuration file.
args_config – A dictionary containing additional configuration parameters that may override or supplement the settings loaded from the config file.
- sequifier.preprocess.preprocess_batch(project_path: str, data_name_root: str, process_id: int, batch: DataFrame, schema: Any, split_paths: list[str], seq_length: int, seq_step_sizes: list[int], data_columns: list[str], col_types: dict[str, str], group_proportions: list[float], target_dir: str, write_format: str, batches_per_file: int) None[source]¶
Processes a batch of data.
- Parameters:
project_path – The path to the sequifier project directory.
data_name_root – The root name of the data file.
process_id – The id of the process.
batch – The batch of data to process.
schema – The schema for the preprocessed data.
split_paths – The paths to the output split files.
seq_length – The sequence length for the model inputs.
seq_step_sizes – A list of step sizes for creating subsequences.
data_columns – A list of data columns.
col_types – A dictionary containing the column types.
group_proportions – A list of floats that define the relative sizes of data splits.
target_dir – The target directory for temporary files.
write_format – The file format for the output files.
batches_per_file – The number of batches to process per file.
- sequifier.preprocess.process_and_write_data_pt(data: DataFrame, seq_length: int, path: str, column_types: dict[str, str])[source]¶
Processes the sequence DataFrame and writes it to a .pt file.
This function takes the long-format sequence DataFrame (data), aggregates it by sequenceId and subsequenceId, and pivots it so that each inputCol becomes its own column containing a list of sequence items. It also extracts the startItemPosition.
It then converts these lists into NumPy arrays, splits them into sequences (all but last item) and targets (all but first item), and converts them to PyTorch tensors along with sequence/subsequence IDs and start positions. The final data tuple (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor) is saved to a .pt file using torch.save.
- Parameters:
data – The long-format Polars DataFrame of extracted sequences.
seq_length – The total sequence length (N). The resulting tensors will have sequence length N-1.
path – The output file path (e.g., “data/batch_0.pt”).
column_types – A dictionary mapping column names to their string data types, used to determine the correct torch dtype.
- sequifier.make.make(args)[source]¶
Creates a new sequifier project.
- Parameters:
args – The command-line arguments.
- class sequifier.helpers.LogFile(path: str, rank: int | None = None)[source]¶
Manages logging to multiple files based on verbosity levels.
This class opens multiple log files based on a path template and a hardcoded list of levels (2 and 3). Messages are written to files based on their assigned level, and high-level messages are also printed to the console on the main process (rank 0).
- rank¶
The rank of the current process, used to control console output.
- Type:
Optional[int]
- levels¶
The hardcoded list of log levels [2, 3] for which files are created.
- Type:
list[int]
- _files¶
A dictionary mapping log levels to their open file handlers.
- Type:
dict[int, io.TextIOWrapper]
- _path¶
The original path template provided.
- Type:
str
- __init__(path: str, rank: int | None = None)[source]¶
Initializes the LogFile and opens log files.
The path argument should be a template containing “[NUMBER]”, which will be replaced by the log levels (2 and 3) to create separate log files.
- Parameters:
path – The path template for the log files (e.g., “run_log_[NUMBER].txt”).
rank – The rank of the current process (e.g., in distributed training). If None or 0, high-level messages will be printed to stdout.
- write(string: str, level: int = 3) None[source]¶
Writes a string to log files and potentially the console.
The string is written to all log files whose level is less than or equal to the specified level.
A message with level=2 goes to file 2.
A message with level=3 goes to file 2 and file 3.
If level is 3 or greater, the message is also printed to stdout if self.rank is None or 0.
- Parameters:
string – The message to log.
level – The verbosity level of the message. Defaults to 3.
- sequifier.helpers.construct_index_maps(id_maps: dict[str, dict[str | int, int]] | None, target_columns_index_map: list[str], map_to_id: bool | None) dict[str, dict[int, str | int]][source]¶
Constructs reverse index maps (int index to original ID).
This function creates reverse mappings from the integer indices back to the original string or integer identifiers. It only performs this operation if map_to_id is True and id_maps is provided.
A special mapping for index 0 is added: - If original IDs are strings, 0 maps to “unknown”. - If original IDs are integers, 0 maps to (minimum original ID) - 1.
- Parameters:
id_maps – A nested dictionary mapping column names to their respective ID-to-index maps (e.g., {‘col_name’: {‘original_id’: 1, …}}). Expected to be provided if map_to_id is True.
target_columns_index_map – A list of column names for which to construct the reverse maps.
map_to_id – A boolean flag. If True, the reverse maps are constructed. If False or None, an empty dictionary is returned.
- Returns:
A dictionary where keys are column names from target_columns_index_map and values are the reverse maps (index-to-original-ID). Returns an empty dict if map_to_id is not True.
- Raises:
AssertionError – If map_to_id is True but id_maps is None.
AssertionError – If the values of a map are not consistently string or integer (excluding the added ‘0’ key).
- sequifier.helpers.normalize_path(path: str, project_path: str) str[source]¶
Normalizes a path to be relative to a project path, then joins them.
This function ensures that a given path is correctly expressed as an absolute path rooted at project_path. It does this by first removing the project_path prefix from path (if it exists) and then joining the result back to project_path.
This is useful for handling paths that might be provided as either relative (e.g., “data/file.txt”) or absolute (e.g., “/abs/path/to/project/data/file.txt”).
- Parameters:
path – The path to normalize.
project_path – The absolute path to the project’s root directory.
- Returns:
A normalized, absolute path.
- sequifier.helpers.numpy_to_pytorch(data: DataFrame, column_types: dict[str, dtype], all_columns: list[str], seq_length: int) dict[str, Tensor][source]¶
Converts a long-format Polars DataFrame to a dict of sequence tensors.
This function assumes the input DataFrame data is in a long format where each row represents a sequence for a specific feature. It expects a column named “inputCol” that contains the feature name (e.g., ‘price’, ‘volume’) and other columns representing time steps (e.g., “0”, “1”, …, “L”).
It generates two tensors for each column in all_columns: 1. An “input” tensor (from time steps L down to 1). 2. A “target” tensor (from time steps L-1 down to 0).
Example
For seq_length = 3 and all_columns = [‘price’], it will create: - ‘price’: Tensor from columns [“3”, “2”, “1”] - ‘price_target’: Tensor from columns [“2”, “1”, “0”]
- Parameters:
data – The long-format Polars DataFrame. Must contain “inputCol” and columns named as strings of integers for time steps.
column_types – A dictionary mapping feature names (from “inputCol”) to their desired torch.dtype.
all_columns – A list of all feature names (from “inputCol”) to be processed and converted into tensors.
seq_length – The total sequence length (L). This determines the column names for time steps (e.g., “0” to “L”).
- Returns:
A dictionary mapping feature names to their corresponding PyTorch tensors. Target tensors are stored with a _target suffix (e.g., {‘price’: <tensor>, ‘price_target’: <tensor>}).
- sequifier.helpers.read_data(path: str, read_format: str, columns: list[str] | None = None) DataFrame[source]¶
Reads data from a CSV or Parquet file into a Polars DataFrame.
- Parameters:
path – The file path to read from.
read_format – The format of the file. Supported formats are “csv” and “parquet”.
columns – An optional list of column names to read. This argument is only used when read_format is “parquet”.
- Returns:
A Polars DataFrame containing the data from the file.
- Raises:
ValueError – If read_format is not “csv” or “parquet”.
- sequifier.helpers.subset_to_selected_columns(data: DataFrame | LazyFrame, selected_columns: list[str]) DataFrame | LazyFrame[source]¶
Filters a DataFrame to rows where ‘inputCol’ is in a selected list.
This function supports both Polars (DataFrame, LazyFrame) and Pandas DataFrames, dispatching to the appropriate filtering method.
For Polars objects, it uses data.filter(pl.col(“inputCol”).is_in(…)).
For other objects (presumably Pandas), it builds a numpy boolean mask and filters using data.loc[…].
Note: The type hint only specifies Polars objects, but the implementation includes a fallback path for Pandas-like objects.
- Parameters:
data – The Polars (DataFrame, LazyFrame) or Pandas DataFrame to filter. It must contain a column named “inputCol”.
selected_columns – A list of values. Rows will be kept if their value in “inputCol” is present in this list.
- Returns:
A filtered DataFrame or LazyFrame of the same type as the input.
- sequifier.helpers.write_data(data: DataFrame, path: str, write_format: str, **kwargs) None[source]¶
Writes a Polars (or Pandas) DataFrame to a CSV or Parquet file.
This function detects the type of the input DataFrame. - For Polars DataFrames, it uses .write_csv() or .write_parquet(). - For other DataFrame types (presumably Pandas), it uses .to_csv()
or .to_parquet().
Note: The type hint specifies pl.DataFrame, but the implementation includes a fallback path that suggests compatibility with Pandas DataFrames.
- Parameters:
data – The Polars (or Pandas) DataFrame to write.
path – The destination file path.
write_format – The format to write. Supported formats are “csv” and “parquet”.
**kwargs – Additional keyword arguments passed to the underlying write function (e.g., write_csv for Polars, to_csv for Pandas).
- Returns:
None.
- Raises:
ValueError – If write_format is not “csv” or “parquet”.