Synthetic Data Generation Examples
This section demonstrates how to generate synthetic data that preserves the statistical properties of your original data while ensuring privacy. SecureML provides multiple methods with different trade-offs between simplicity, statistical accuracy, and privacy protection.
Basic Synthetic Data Generation
The simplest way to generate synthetic data is using the simple method:
import pandas as pd
from secureml.synthetic import generate_synthetic_data
# Sample data with sensitive information
data = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Robert Johnson'],
'age': [34, 29, 42],
'gender': ['Male', 'Female', 'Male'],
'email': ['john.smith@example.com', 'jane.doe@example.com', 'robert.j@example.com'],
'income': [65000, 72000, 58000],
'credit_score': [720, 750, 680],
'zipcode': ['12345', '23456', '34567']
})
# Generate synthetic data using simple method
synthetic_data = generate_synthetic_data(
template=data,
num_samples=20,
method="simple",
sensitive_columns=['name', 'email'],
seed=42
)
print("Synthetic data sample:")
print(synthetic_data.head())
The simple method generates values by sampling from observed distributions for non-sensitive columns, while using more sophisticated techniques (like Faker) for sensitive columns.
Statistical Method
For better preservation of statistical relationships, use the statistical method:
# Generate synthetic data with statistical method
statistical_synthetic = generate_synthetic_data(
template=data,
num_samples=20,
method="statistical",
sensitive_columns=['name', 'email'],
preserve_dtypes=True,
preserve_outliers=True,
categorical_threshold=10,
handle_skewness=True,
seed=42
)
# Compare correlations
numeric_cols = ['age', 'income', 'credit_score']
print("Original correlation matrix:")
print(data[numeric_cols].corr())
print("\nSynthetic correlation matrix:")
print(statistical_synthetic[numeric_cols].corr())
The statistical method preserves:
- Individual column distributions
- Correlations between variables
- Data types and value ranges
Automatic Sensitive Column Detection
SecureML can automatically detect columns likely to contain sensitive information:
# Generate synthetic data with automatic sensitive column detection
auto_synthetic = generate_synthetic_data(
template=data,
num_samples=20,
method="statistical",
sensitivity_detection={
"auto_detect": True,
"confidence_threshold": 0.7,
"sample_size": 100
},
seed=42
)
The sensitivity detection looks at both column names and data content patterns to identify personal identifiers, financial information, health data, and other sensitive categories.
Schema-Based Generation
You can generate synthetic data directly from a schema without an existing dataset:
# Define a schema for financial customer data
schema = {
"columns": {
"customer_id": "int",
"age": "int",
"income": "float",
"credit_score": "int",
"education_level": "category",
"employment_status": "category",
"has_mortgage": "bool",
"has_loan": "bool",
"account_balance": "float"
}
}
# Generate synthetic data from schema
schema_synthetic = generate_synthetic_data(
template=schema,
num_samples=100,
method="statistical",
seed=42
)
Advanced Synthetic Methods
SDV Integration Methods
SecureML integrates with the Synthetic Data Vault (SDV) library for more sophisticated generation methods:
# SDV's Gaussian Copula method
try:
sdv_copula_synthetic = generate_synthetic_data(
template=data,
num_samples=100,
method="sdv-copula",
sensitive_columns=['name', 'email'],
anonymize_fields=True,
seed=42
)
except ImportError:
print("SDV package not installed. Install with: pip install sdv")
# SDV's CTGAN method (deep learning approach)
try:
sdv_ctgan_synthetic = generate_synthetic_data(
template=data,
num_samples=100,
method="sdv-ctgan",
sensitive_columns=['name', 'email'],
anonymize_fields=True,
epochs=300,
batch_size=32,
seed=42
)
except ImportError:
print("SDV package not installed. Install with: pip install sdv")
You can also specify constraints on the generated data:
# Define constraints for SDV methods
constraints = [
{"type": "unique", "columns": ["customer_id"]},
{"type": "fixed_combinations", "column_names": ["state", "city"]},
{"type": "inequality", "low_column": "min_salary", "high_column": "max_salary"}
]
# Generate data with constraints
sdv_synthetic = generate_synthetic_data(
template=data,
num_samples=100,
method="sdv-copula",
sensitive_columns=['name', 'email'],
constraints=constraints,
seed=42
)
GAN-based Method
For more complex distributions, use the GAN-based method:
# Generate synthetic data using GAN method
gan_synthetic = generate_synthetic_data(
template=data,
num_samples=100,
method="gan",
sensitive_columns=['name', 'email'],
epochs=300,
batch_size=32,
generator_dim=[128, 128],
discriminator_dim=[128, 128],
learning_rate=0.001,
noise_dim=100,
preserve_dtypes=True,
seed=42
)
Copula-based Method
The copula method captures complex dependencies between variables:
# Generate synthetic data using copula method
copula_synthetic = generate_synthetic_data(
template=data,
num_samples=100,
method="copula",
sensitive_columns=['name', 'email'],
copula_type="gaussian",
fit_method="ml",
preserve_dtypes=True,
handle_missing="mean",
categorical_threshold=10,
handle_skewness=True,
seed=42
)
Comparing Methods
Different synthetic generation methods have different strengths. Here’s a comparison:
import numpy as np
# Number of samples to generate
n_samples = 100
# Generate synthetic data with each method
methods = ["simple", "statistical", "copula"]
synthetic_datasets = {}
for method in methods:
synthetic_datasets[method] = generate_synthetic_data(
template=data,
num_samples=n_samples,
method=method,
sensitive_columns=['name', 'email'],
seed=42
)
# Compare means and standard deviations of numeric columns
numeric_cols = ['age', 'income', 'credit_score']
print(f"{'Column':<15} {'Metric':<10} {'Original':<10}", end="")
for method in methods:
print(f" {method.capitalize():<10}", end="")
print()
for col in numeric_cols:
# Mean comparison
print(f"{col:<15} {'Mean':<10} {data[col].mean():<10.2f}", end="")
for method in methods:
synthetic_mean = synthetic_datasets[method][col].mean()
print(f" {synthetic_mean:<10.2f}", end="")
print()
# Std comparison
print(f"{col:<15} {'Std':<10} {data[col].std():<10.2f}", end="")
for method in methods:
synthetic_std = synthetic_datasets[method][col].std()
print(f" {synthetic_std:<10.2f}", end="")
print()
Evaluating Synthetic Data Quality
You can perform simple evaluations to check synthetic data quality:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Scale the data
scaler = StandardScaler()
original_scaled = scaler.fit_transform(data[numeric_cols])
synthetic_scaled = scaler.transform(synthetic_datasets["statistical"][numeric_cols])
# Apply PCA
pca = PCA(n_components=2)
original_pca = pca.fit_transform(original_scaled)
synthetic_pca = pca.transform(synthetic_scaled)
# Calculate a simple statistical similarity score
mse = 0
for col in numeric_cols:
# Normalized mean difference
mean_diff = (data[col].mean() - synthetic_datasets["statistical"][col].mean()) / data[col].mean()
# Normalized std difference
std_diff = (data[col].std() - synthetic_datasets["statistical"][col].std()) / data[col].std()
mse += (mean_diff ** 2 + std_diff ** 2)
mse /= (len(numeric_cols) * 2) # Average across columns and metrics
print(f"Statistical similarity score (lower is better): {mse:.4f}")
Complete Example
Here’s a complete example that generates synthetic data and compares distributions:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from secureml.synthetic import generate_synthetic_data
# Create sample data
data = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Robert Johnson', 'Emily Williams',
'Michael Brown', 'Sarah Davis', 'David Miller', 'Lisa Wilson'],
'age': [34, 29, 42, 35, 51, 27, 38, 44],
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'email': ['john.smith@example.com', 'jane.doe@example.com',
'robert.j@example.com', 'e.williams@example.com',
'm.brown@example.com', 's.davis@example.com',
'david.m@example.com', 'lisa.wilson@example.com'],
'income': [65000, 72000, 58000, 93000, 81000, 67000, 79000, 82000],
'credit_score': [720, 750, 680, 790, 705, 740, 710, 760],
'zipcode': ['12345', '23456', '34567', '45678', '56789', '67890', '78901', '89012']
})
# Generate synthetic data
synthetic_data = generate_synthetic_data(
template=data,
num_samples=100,
method="statistical",
sensitive_columns=['name', 'email'],
sensitivity_detection={
"auto_detect": True, # Auto-detect additional sensitive columns
"confidence_threshold": 0.7
},
preserve_dtypes=True,
handle_skewness=True,
seed=42
)
# Save the synthetic data
synthetic_data.to_csv("synthetic_customer_data.csv", index=False)
# Compare distributions
numeric_cols = ['age', 'income', 'credit_score']
# Set up the figure
plt.figure(figsize=(15, 5))
# Plot histograms for each numeric column
for i, col in enumerate(numeric_cols):
plt.subplot(1, 3, i+1)
plt.hist(data[col], alpha=0.5, label='Original', bins=10)
plt.hist(synthetic_data[col], alpha=0.5, label='Synthetic', bins=10)
plt.title(f'Distribution of {col}')
plt.legend()
plt.tight_layout()
plt.savefig('synthetic_data_comparison.png')
print("Synthetic data generated and saved to synthetic_customer_data.csv")
print("Distribution comparison saved to synthetic_data_comparison.png")
Best Practices
Choose the right method: - For simple datasets: use “simple” or “statistical” - For complex relationships: use “sdv-copula”, “sdv-ctgan”, or “copula”
Always identify sensitive columns: Either specify them explicitly or use the automatic detection feature.
Set a seed for reproducibility: This ensures you get the same results each time.
Evaluate your synthetic data: Compare the distributions and relationships against the original data.
Balance privacy and utility: Adjust parameters to find the right balance for your use case.
Handle sensitive data carefully: Make sure the synthetic data doesn’t leak any information from the original dataset.