Synthetic Data API
This module provides tools to create synthetic datasets that mimic the statistical properties of real data without containing sensitive information, enabling privacy-preserving machine learning development and testing.
Main Functions
- secureml.synthetic.generate_synthetic_data(template: DataFrame | Dict[str, Any], num_samples: int = 100, method: str = 'simple', sensitive_columns: List[str] | None = None, seed: int | None = None, sensitivity_detection: Dict[str, Any] | None = None, **kwargs: Any) DataFrame
Generate synthetic data based on a template dataset.
- Args:
template: Original dataset to mimic (DataFrame) or schema specification (Dict) num_samples: Number of synthetic samples to generate method: Generation method: ‘simple’, ‘statistical’, ‘sdv-copula’,
‘sdv-ctgan’, ‘sdv-tvae’, ‘gan’, or ‘copula’
sensitive_columns: Columns that contain sensitive data and need special handling seed: Random seed for reproducibility sensitivity_detection: Configuration for sensitive column detection:
sample_size: Number of rows to sample (default: 100)
confidence_threshold: Minimum confidence score (default: 0.5)
auto_detect: Whether to auto-detect sensitive columns if none provided (default: True)
**kwargs: Additional parameters for specific generation methods
- Returns:
DataFrame containing synthetic data
- Raises:
ValueError: If an unsupported generation method is specified ImportError: If an SDV method is requested but SDV is not installed
The main function for generating synthetic data from a template dataset:
from secureml.synthetic import generate_synthetic_data
import pandas as pd
# Load a real dataset as a template
template_data = pd.read_csv("patient_data.csv")
# Generate 1000 synthetic samples using the statistical method
synthetic_data = generate_synthetic_data(
template=template_data,
num_samples=1000,
method="statistical",
sensitive_columns=["name", "email", "ssn", "address"],
seed=42 # For reproducibility
)
# Save the synthetic dataset
synthetic_data.to_csv("synthetic_patient_data.csv", index=False)
Supported Generation Methods
The module supports several methods for generating synthetic data, each with different characteristics and trade-offs:
Simple Method (
method="simple")A basic approach that preserves individual column distributions but not complex relationships between variables.
# Generate simple synthetic data simple_synthetic = generate_synthetic_data( template=data, num_samples=500, method="simple" )
Statistical Method (
method="statistical")A more sophisticated approach that preserves correlations and statistical properties.
# Generate synthetic data preserving statistical properties statistical_synthetic = generate_synthetic_data( template=data, num_samples=500, method="statistical", handle_skewness=True, preserve_outliers=False )
SDV Methods (
method="sdv-copula",method="sdv-ctgan",method="sdv-tvae")Leverages the Synthetic Data Vault library for advanced generative models:
# Generate synthetic data using a Gaussian Copula model copula_synthetic = generate_synthetic_data( template=data, num_samples=500, method="sdv-copula", constraints=[ {"type": "unique", "columns": ["id"]} ] ) # Generate synthetic data using a GAN-based model ctgan_synthetic = generate_synthetic_data( template=data, num_samples=500, method="sdv-ctgan", epochs=100 )
Direct Methods (
method="gan",method="copula")Direct implementations of generative adversarial networks and copulas:
# Generate synthetic data using a GAN model gan_synthetic = generate_synthetic_data( template=data, num_samples=500, method="gan", epochs=300, batch_size=32 )
Sensitive Data Handling
The module provides automatic detection and special handling for sensitive columns:
# Automatic detection of sensitive columns
synthetic_data = generate_synthetic_data(
template=data,
num_samples=1000,
method="statistical",
sensitivity_detection={
"auto_detect": True,
"confidence_threshold": 0.7,
"sample_size": 100
}
)
# Explicitly specify sensitive columns
synthetic_data = generate_synthetic_data(
template=data,
num_samples=1000,
method="statistical",
sensitive_columns=["email", "phone", "ssn", "medical_record"]
)
Creating Synthetic Data from Schema
You can also generate synthetic data from a schema definition without a template dataset:
# Define a schema
schema = {
"columns": {
"age": "int",
"income": "float",
"education": "category",
"marital_status": "category"
}
}
# Generate synthetic data from schema
synthetic_from_schema = generate_synthetic_data(
template=schema,
num_samples=1000,
method="statistical"
)
Using Constraints with SDV Methods
When using SDV-based methods, you can apply constraints to the synthetic data:
# Generate synthetic data with constraints
synthetic_with_constraints = generate_synthetic_data(
template=data,
num_samples=500,
method="sdv-copula",
constraints=[
{"type": "unique", "columns": ["id"]},
{"type": "fixed_combinations", "column_names": ["city", "state"]},
{"type": "inequality", "low_column": "start_date", "high_column": "end_date"}
]
)
Helper Functions
The module also provides several helper functions that are used internally:
_identify_sensitive_columns: Automatically identifies columns containing sensitive data_generate_simple_synthetic: Implements the simple generation method_generate_statistical_synthetic: Implements the statistical generation method_generate_sdv_synthetic: Implements SDV-based generation methods_generate_gan_synthetic: Implements GAN-based generation_generate_copula_synthetic: Implements copula-based generation
Best Practices
Start Simple: Begin with simpler methods like “simple” or “statistical” before trying more complex models
Evaluate Quality: Compare synthetic data distributions with the original data
Handle Sensitive Data: Always specify sensitive columns or enable auto-detection
Set Seed: Use the seed parameter for reproducible results
Balance Privacy and Utility: More complex methods may preserve utility better but might have privacy implications
Constraints Matter: Use constraints with SDV methods to ensure business rules are preserved