Anonymization API
This module provides tools to anonymize sensitive data before using it in machine learning workflows.
Main Function
- secureml.anonymization.anonymize(data: Any, method: str = 'k-anonymity', k: int = 5, sensitive_columns: List[str] | None = None, columns: List[str] | None = None, **kwargs: Any) Any
Anonymize a dataset using the specified method.
Supports multiple input formats including: - pandas DataFrame - list of dictionaries (e.g., [{‘col1’: 1}, {‘col1’: 2}]) - dictionary of lists (e.g., {‘col1’: [1, 2]}) - NumPy array (requires ‘columns’ argument) - File path to a .csv or .json file - Polars DataFrame (if installed) - PyArrow Table (if installed)
- Args:
data: The dataset to anonymize. method: Anonymization method to use. Options: ‘k-anonymity’, ‘pseudonymization’,
‘data-masking’, ‘generalization’.
k: Parameter for k-anonymity (minimum size of equivalent classes). sensitive_columns: List of column names containing sensitive information. columns: List of column names, required for inputs like NumPy arrays
that do not have inherent column labels.
**kwargs: Additional parameters for specific anonymization methods.
- Returns:
The anonymized dataset in the same format as the input. For file path inputs, returns a pandas DataFrame.
- Raises:
- ValueError: If an unsupported anonymization method is specified or
required arguments are missing.
TypeError: If the input data format is not supported.
Anonymization Techniques
K-Anonymity
K-anonymity ensures that each combination of quasi-identifier values appears at least k times in the dataset, making it difficult to re-identify individuals.
Internal Implementation
- secureml.anonymization._apply_k_anonymity(data: DataFrame, sensitive_columns: List[str], k: int = 5, quasi_identifier_columns: List[str] | None = None, categorical_generalization_levels: Dict[str, Dict[str, str]] | None = None, numeric_generalization_strategy: str = 'equal_width', max_generalization_iterations: int = 5, suppression_threshold: float = 0.05, **kwargs: Any) DataFrame
Apply k-anonymity to the dataset using generalization and suppression.
K-anonymity ensures that each combination of quasi-identifier values appears at least k times in the dataset, making it difficult to re-identify individuals.
- Args:
data: The dataset to anonymize sensitive_columns: Columns containing sensitive information that should be preserved k: The k value for k-anonymity (minimum size of equivalent classes) quasi_identifier_columns: Columns that could be used for re-identification
(if None, all non-sensitive columns are considered quasi-identifiers)
- categorical_generalization_levels: Dictionary mapping categorical columns to
their generalization hierarchies. Format: {column: {value: generalized_value}}
- numeric_generalization_strategy: Strategy for generalizing numeric columns
Options: “equal_width”, “equal_frequency”, “mdlp”
max_generalization_iterations: Maximum number of generalization iterations suppression_threshold: Maximum fraction of records that can be suppressed **kwargs: Additional parameters
- Returns:
The k-anonymized dataset
Pseudonymization
Pseudonymization replaces identifying data with artificial identifiers while preserving data characteristics.
Internal Implementation
- secureml.anonymization._apply_pseudonymization(data: DataFrame, sensitive_columns: List[str], strategy: str = 'hash', preserve_format: bool = True, salt: str | None = None, **kwargs: Any) DataFrame
Apply pseudonymization to the dataset using modern techniques.
This implementation provides multiple pseudonymization strategies while preserving data characteristics and format where appropriate. It supports: - Hash-based pseudonymization (default) - Format-preserving encryption - Deterministic pseudonymization with salt - Custom mapping with format preservation
- Args:
data: The dataset to anonymize sensitive_columns: Columns containing sensitive information strategy: Pseudonymization strategy to use. Options:
“hash”: Hash-based pseudonymization (default)
“fpe”: Format-preserving encryption
“deterministic”: Deterministic pseudonymization with salt
“custom”: Custom mapping with format preservation
preserve_format: Whether to preserve the format of the original values salt: Optional salt for deterministic pseudonymization **kwargs: Additional parameters for specific strategies
- Returns:
The pseudonymized dataset
- Raises:
- ValueError: If an unsupported strategy is specified or required parameters
are missing
Data Masking
Data masking hides specific parts of data while preserving its format and potentially statistical properties.
Internal Implementation
- secureml.anonymization._apply_data_masking(data: DataFrame, sensitive_columns: List[str], masking_rules: Dict[str, Dict[str, Any]] | None = None, default_strategy: str = 'character', preserve_format: bool = True, preserve_statistics: bool = False, **kwargs: Any) DataFrame
Apply data masking to the dataset with advanced configuration options.
This implementation provides multiple masking strategies customized for different data types and use cases. It supports format preservation, statistical property preservation, and column-specific masking rules.
- Args:
data: The dataset to anonymize sensitive_columns: Columns containing sensitive information masking_rules: Dictionary mapping column names to masking configurations.
Format: {column_name: {strategy: str, **strategy_params}}
- default_strategy: Default masking strategy when not specified in masking_rules
Options: “character”, “fixed”, “regex”, “random”, “redact”, “nullify”
preserve_format: Whether to preserve the general format of the original values preserve_statistics: For numeric columns, whether to preserve statistical
properties like mean and range
**kwargs: Additional parameters for specific strategies
- Returns:
The masked dataset
- Example:
- masking_rules = {
“email”: {“strategy”: “regex”, “pattern”: r”(.)(.*)(@.*)”}, “credit_card”: {“strategy”: “character”, “show_first”: 4, “show_last”: 4}, “phone”: {“strategy”: “fixed”, “mask_char”: “X”, “format”: “XXX-XXX-XXXX”}, “income”: {“strategy”: “random”, “preserve_statistics”: True}
}
The module implements multiple masking strategies:
Character masking (show/hide specific portions)
Fixed masking (with a predefined format)
Regex-based masking (pattern replacement)
Random masking (with statistical preservation)
Redaction (complete replacement)
Nullification (data removal)
Generalization
Generalization reduces the granularity of data to protect privacy while maintaining analytical utility.
Internal Implementation
- secureml.anonymization._apply_generalization(data: DataFrame, sensitive_columns: List[str], generalization_rules: Dict[str, Dict[str, Any]] | None = None, default_method: str = 'range', hierarchical_taxonomies: Dict[str, Dict[str, Any]] | None = None, preserve_statistics: bool = True, **kwargs: Any) DataFrame
Apply data generalization with configurable approaches for different column types.
Generalization reduces the granularity of data to protect privacy while maintaining analytical utility. This implementation provides multiple methods tailored to different data types and use cases.
- Args:
data: The dataset to generalize sensitive_columns: Columns containing sensitive information generalization_rules: Dictionary mapping column names to generalization configs
Format: {column_name: {method: str, **method_params}}
- default_method: Default generalization method if not specified in rules
Options: “range”, “hierarchy”, “binning”, “topk”, “rounding”, “concept”
- hierarchical_taxonomies: Dictionary of pre-defined hierarchical taxonomies
for categorical data (e.g., location: {city → region → country})
- preserve_statistics: For numeric columns, whether to preserve statistical
properties for analytical purposes
**kwargs: Additional parameters for specific generalization methods
- Returns:
The generalized dataset
- Example:
- generalization_rules = {
“age”: {“method”: “range”, “range_size”: 10, “min_bound”: 0}, “zipcode”: {“method”: “topk”, “k”: 3, “other_value”: “Other”}, “diagnosis”: {“method”: “hierarchy”, “taxonomy_name”: “icd10”}, “income”: {“method”: “binning”, “num_bins”: 5, “strategy”: “quantile”}
}
The module implements multiple generalization methods:
Range generalization (for numerical data)
Hierarchical generalization (using taxonomies)
Binning (equal width, equal frequency, etc.)
Top-k generalization (keep only the k most frequent values)
Rounding (to a specified base)
Concept hierarchies (map values to higher-level concepts)
Date generalization (year, month, quarter, etc.)
String generalization (prefix, suffix, etc.)
Utility Functions
- secureml.anonymization._identify_sensitive_columns(data: DataFrame) List[str]
Automatically identify sensitive columns in a dataset using pattern matching and content analysis.
This implementation categorizes sensitive data according to modern privacy frameworks like GDPR, CCPA, HIPAA, and ISO/IEC 27701, considering both direct and quasi-identifiers.
- Args:
data: The dataset to analyze
- Returns:
A list of column names that appear to contain sensitive information
This function automatically identifies columns that likely contain sensitive information based on:
Column names matching patterns associated with sensitive data
Content analysis (detecting emails, phone numbers, names, etc.)
Privacy framework categorizations (GDPR, CCPA, HIPAA, ISO/IEC 27701)