Synthetic Data
Synthetic data generation creates artificial data that statistically resembles real data while protecting privacy. SecureML provides several techniques to generate high-quality synthetic data that maintains the utility of the original dataset without exposing sensitive information.
Core Concepts
Statistical Preservation: Maintaining the statistical properties of the original data, such as distributions, correlations, and dependencies.
Privacy Guarantees: Ensuring that the synthetic data doesn’t leak information about specific individuals in the original dataset.
Utility Preservation: Ensuring models trained on synthetic data perform similarly to those trained on real data.
Basic Usage
Generating Synthetic Data
The main function for generating synthetic data is generate_synthetic_data:
from secureml.synthetic import generate_synthetic_data
# Generate synthetic data
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="statistical", # Options: "simple", "statistical", "sdv-copula", "sdv-ctgan", "sdv-tvae", "gan", "copula"
sensitive_columns=["name", "email", "ssn", "phone"],
seed=42
)
Automatic Detection of Sensitive Columns
SecureML can automatically detect sensitive columns in your data:
from secureml.synthetic import generate_synthetic_data
# Generate synthetic data with automatic sensitive column detection
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="statistical",
sensitivity_detection={
"auto_detect": True,
"confidence_threshold": 0.7,
"sample_size": 100
}
)
Supported Methods
SecureML supports several synthetic data generation methods:
Simple Random Sampling
Basic method suitable for quick prototyping:
# Generate synthetic data using simple method
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="simple",
sensitive_columns=["name", "email", "ssn"]
)
Statistical Method
More sophisticated method that preserves statistical relationships:
# Generate synthetic data using statistical method
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="statistical",
sensitive_columns=["name", "email", "ssn"],
preserve_dtypes=True,
preserve_outliers=True,
categorical_threshold=20,
handle_skewness=True,
seed=42
)
SDV Integration Methods
Integration with the Synthetic Data Vault (SDV) library for advanced generation (requires SDV to be installed):
# Generate synthetic data using SDV's Gaussian Copula
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="sdv-copula",
sensitive_columns=["name", "email", "ssn"],
anonymize_fields=True
)
# Generate synthetic data using SDV's CTGAN
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="sdv-ctgan",
sensitive_columns=["name", "email", "ssn"],
anonymize_fields=True,
epochs=300,
batch_size=500
)
# Generate synthetic data using SDV's TVAE
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="sdv-tvae",
sensitive_columns=["name", "email", "ssn"],
anonymize_fields=True
)
GAN-based Method
Generative Adversarial Network approach (without requiring SDV):
# Generate synthetic data using GAN method
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="gan",
sensitive_columns=["name", "email", "ssn"],
epochs=300,
batch_size=32,
generator_dim=[128, 128],
discriminator_dim=[128, 128],
learning_rate=0.001,
noise_dim=100,
preserve_dtypes=True
)
Copula-based Method
Copula method for capturing variable dependencies:
# Generate synthetic data using copula method
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="copula",
sensitive_columns=["name", "email", "ssn"],
copula_type="gaussian",
fit_method="ml",
preserve_dtypes=True,
handle_missing="mean",
categorical_threshold=20,
handle_skewness=True,
seed=42
)
Providing Data Schema Instead of Template DataFrame
You can generate synthetic data using a schema definition instead of an actual DataFrame:
# Define a schema
schema = {
"columns": {
"age": "int",
"income": "float",
"gender": "category",
"education": "category"
}
}
# Generate synthetic data from schema
synthetic_df = generate_synthetic_data(
template=schema,
num_samples=1000,
method="statistical"
)
Advanced Usage
SDV Constraints
When using SDV methods, you can specify constraints on the generated data:
# Define constraints for SDV methods
constraints = [
{"type": "unique", "columns": ["id"]},
{"type": "fixed_combinations", "column_names": ["state", "city"]},
{"type": "inequality", "low_column": "min_salary", "high_column": "max_salary"}
]
# Generate synthetic data with constraints
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="sdv-copula",
constraints=constraints
)
Handling Sensitive Information
The function automatically generates realistic but fake data for sensitive columns:
# Generate synthetic data with sensitive column handling
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=1000,
method="statistical",
sensitive_columns=["name", "email", "phone", "ssn", "credit_card"]
)
Best Practices
Choose the right method: Select the generation method based on your data characteristics: - For simple datasets with low complexity: “simple” - For general-purpose generation with good statistical properties: “statistical” - For complex tabular data with mixed types: “sdv-ctgan” or “sdv-tvae” - For data with important correlations: “sdv-copula” or “copula”
Automatic sensitive column detection: When in doubt about which columns are sensitive, use the automatic detection feature.
Seed for reproducibility: Always set a seed when you need reproducible results.
Evaluate your synthetic data: Check that the synthetic data preserves important statistical properties while providing sufficient privacy protection.
Balance privacy and utility: Adjust parameters to find the right balance between privacy protection and synthetic data utility.
Example Workflow
Complete workflow for generating and checking synthetic data:
import pandas as pd
from secureml.synthetic import generate_synthetic_data
# Load original data
original_df = pd.read_csv("customer_data.csv")
# Generate synthetic data with automatic sensitive column detection
synthetic_df = generate_synthetic_data(
template=original_df,
num_samples=len(original_df),
method="statistical",
sensitivity_detection={"auto_detect": True, "confidence_threshold": 0.7},
seed=42,
preserve_dtypes=True,
handle_skewness=True
)
# Save synthetic data
synthetic_df.to_csv("synthetic_customer_data.csv", index=False)
# Basic validation - check column distributions
for col in original_df.select_dtypes(include=['number']).columns:
print(f"Column: {col}")
print(f"Original mean: {original_df[col].mean()}, std: {original_df[col].std()}")
print(f"Synthetic mean: {synthetic_df[col].mean()}, std: {synthetic_df[col].std()}")
print()
Further Reading
Synthetic Data API - Complete API reference for synthetic data functions
Synthetic Data Generation Examples - More examples of synthetic data generation techniques