Synthetic Data Generation Examples

This section demonstrates how to generate synthetic data that preserves the statistical properties of your original data while ensuring privacy. SecureML provides multiple methods with different trade-offs between simplicity, statistical accuracy, and privacy protection.

Basic Synthetic Data Generation

The simplest way to generate synthetic data is using the simple method:

import pandas as pd
from secureml.synthetic import generate_synthetic_data

# Sample data with sensitive information
data = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Robert Johnson'],
    'age': [34, 29, 42],
    'gender': ['Male', 'Female', 'Male'],
    'email': ['john.smith@example.com', 'jane.doe@example.com', 'robert.j@example.com'],
    'income': [65000, 72000, 58000],
    'credit_score': [720, 750, 680],
    'zipcode': ['12345', '23456', '34567']
})

# Generate synthetic data using simple method
synthetic_data = generate_synthetic_data(
    template=data,
    num_samples=20,
    method="simple",
    sensitive_columns=['name', 'email'],
    seed=42
)

print("Synthetic data sample:")
print(synthetic_data.head())

The simple method generates values by sampling from observed distributions for non-sensitive columns, while using more sophisticated techniques (like Faker) for sensitive columns.

Statistical Method

For better preservation of statistical relationships, use the statistical method:

# Generate synthetic data with statistical method
statistical_synthetic = generate_synthetic_data(
    template=data,
    num_samples=20,
    method="statistical",
    sensitive_columns=['name', 'email'],
    preserve_dtypes=True,
    preserve_outliers=True,
    categorical_threshold=10,
    handle_skewness=True,
    seed=42
)

# Compare correlations
numeric_cols = ['age', 'income', 'credit_score']
print("Original correlation matrix:")
print(data[numeric_cols].corr())
print("\nSynthetic correlation matrix:")
print(statistical_synthetic[numeric_cols].corr())

The statistical method preserves: - Individual column distributions - Correlations between variables - Data types and value ranges

Automatic Sensitive Column Detection

SecureML can automatically detect columns likely to contain sensitive information:

# Generate synthetic data with automatic sensitive column detection
auto_synthetic = generate_synthetic_data(
    template=data,
    num_samples=20,
    method="statistical",
    sensitivity_detection={
        "auto_detect": True,
        "confidence_threshold": 0.7,
        "sample_size": 100
    },
    seed=42
)

The sensitivity detection looks at both column names and data content patterns to identify personal identifiers, financial information, health data, and other sensitive categories.

Schema-Based Generation

You can generate synthetic data directly from a schema without an existing dataset:

# Define a schema for financial customer data
schema = {
    "columns": {
        "customer_id": "int",
        "age": "int",
        "income": "float",
        "credit_score": "int",
        "education_level": "category",
        "employment_status": "category",
        "has_mortgage": "bool",
        "has_loan": "bool",
        "account_balance": "float"
    }
}

# Generate synthetic data from schema
schema_synthetic = generate_synthetic_data(
    template=schema,
    num_samples=100,
    method="statistical",
    seed=42
)

Advanced Synthetic Methods

SDV Integration Methods

SecureML integrates with the Synthetic Data Vault (SDV) library for more sophisticated generation methods:

# SDV's Gaussian Copula method
try:
    sdv_copula_synthetic = generate_synthetic_data(
        template=data,
        num_samples=100,
        method="sdv-copula",
        sensitive_columns=['name', 'email'],
        anonymize_fields=True,
        seed=42
    )
except ImportError:
    print("SDV package not installed. Install with: pip install sdv")

# SDV's CTGAN method (deep learning approach)
try:
    sdv_ctgan_synthetic = generate_synthetic_data(
        template=data,
        num_samples=100,
        method="sdv-ctgan",
        sensitive_columns=['name', 'email'],
        anonymize_fields=True,
        epochs=300,
        batch_size=32,
        seed=42
    )
except ImportError:
    print("SDV package not installed. Install with: pip install sdv")

You can also specify constraints on the generated data:

# Define constraints for SDV methods
constraints = [
    {"type": "unique", "columns": ["customer_id"]},
    {"type": "fixed_combinations", "column_names": ["state", "city"]},
    {"type": "inequality", "low_column": "min_salary", "high_column": "max_salary"}
]

# Generate data with constraints
sdv_synthetic = generate_synthetic_data(
    template=data,
    num_samples=100,
    method="sdv-copula",
    sensitive_columns=['name', 'email'],
    constraints=constraints,
    seed=42
)

GAN-based Method

For more complex distributions, use the GAN-based method:

# Generate synthetic data using GAN method
gan_synthetic = generate_synthetic_data(
    template=data,
    num_samples=100,
    method="gan",
    sensitive_columns=['name', 'email'],
    epochs=300,
    batch_size=32,
    generator_dim=[128, 128],
    discriminator_dim=[128, 128],
    learning_rate=0.001,
    noise_dim=100,
    preserve_dtypes=True,
    seed=42
)

Copula-based Method

The copula method captures complex dependencies between variables:

# Generate synthetic data using copula method
copula_synthetic = generate_synthetic_data(
    template=data,
    num_samples=100,
    method="copula",
    sensitive_columns=['name', 'email'],
    copula_type="gaussian",
    fit_method="ml",
    preserve_dtypes=True,
    handle_missing="mean",
    categorical_threshold=10,
    handle_skewness=True,
    seed=42
)

Comparing Methods

Different synthetic generation methods have different strengths. Here’s a comparison:

import numpy as np

# Number of samples to generate
n_samples = 100

# Generate synthetic data with each method
methods = ["simple", "statistical", "copula"]
synthetic_datasets = {}

for method in methods:
    synthetic_datasets[method] = generate_synthetic_data(
        template=data,
        num_samples=n_samples,
        method=method,
        sensitive_columns=['name', 'email'],
        seed=42
    )

# Compare means and standard deviations of numeric columns
numeric_cols = ['age', 'income', 'credit_score']

print(f"{'Column':<15} {'Metric':<10} {'Original':<10}", end="")
for method in methods:
    print(f" {method.capitalize():<10}", end="")
print()

for col in numeric_cols:
    # Mean comparison
    print(f"{col:<15} {'Mean':<10} {data[col].mean():<10.2f}", end="")
    for method in methods:
        synthetic_mean = synthetic_datasets[method][col].mean()
        print(f" {synthetic_mean:<10.2f}", end="")
    print()

    # Std comparison
    print(f"{col:<15} {'Std':<10} {data[col].std():<10.2f}", end="")
    for method in methods:
        synthetic_std = synthetic_datasets[method][col].std()
        print(f" {synthetic_std:<10.2f}", end="")
    print()

Evaluating Synthetic Data Quality

You can perform simple evaluations to check synthetic data quality:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Scale the data
scaler = StandardScaler()
original_scaled = scaler.fit_transform(data[numeric_cols])
synthetic_scaled = scaler.transform(synthetic_datasets["statistical"][numeric_cols])

# Apply PCA
pca = PCA(n_components=2)
original_pca = pca.fit_transform(original_scaled)
synthetic_pca = pca.transform(synthetic_scaled)

# Calculate a simple statistical similarity score
mse = 0
for col in numeric_cols:
    # Normalized mean difference
    mean_diff = (data[col].mean() - synthetic_datasets["statistical"][col].mean()) / data[col].mean()
    # Normalized std difference
    std_diff = (data[col].std() - synthetic_datasets["statistical"][col].std()) / data[col].std()
    mse += (mean_diff ** 2 + std_diff ** 2)
mse /= (len(numeric_cols) * 2)  # Average across columns and metrics

print(f"Statistical similarity score (lower is better): {mse:.4f}")

Complete Example

Here’s a complete example that generates synthetic data and compares distributions:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from secureml.synthetic import generate_synthetic_data

# Create sample data
data = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Robert Johnson', 'Emily Williams',
            'Michael Brown', 'Sarah Davis', 'David Miller', 'Lisa Wilson'],
    'age': [34, 29, 42, 35, 51, 27, 38, 44],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'email': ['john.smith@example.com', 'jane.doe@example.com',
            'robert.j@example.com', 'e.williams@example.com',
            'm.brown@example.com', 's.davis@example.com',
            'david.m@example.com', 'lisa.wilson@example.com'],
    'income': [65000, 72000, 58000, 93000, 81000, 67000, 79000, 82000],
    'credit_score': [720, 750, 680, 790, 705, 740, 710, 760],
    'zipcode': ['12345', '23456', '34567', '45678', '56789', '67890', '78901', '89012']
})

# Generate synthetic data
synthetic_data = generate_synthetic_data(
    template=data,
    num_samples=100,
    method="statistical",
    sensitive_columns=['name', 'email'],
    sensitivity_detection={
        "auto_detect": True,  # Auto-detect additional sensitive columns
        "confidence_threshold": 0.7
    },
    preserve_dtypes=True,
    handle_skewness=True,
    seed=42
)

# Save the synthetic data
synthetic_data.to_csv("synthetic_customer_data.csv", index=False)

# Compare distributions
numeric_cols = ['age', 'income', 'credit_score']

# Set up the figure
plt.figure(figsize=(15, 5))

# Plot histograms for each numeric column
for i, col in enumerate(numeric_cols):
    plt.subplot(1, 3, i+1)
    plt.hist(data[col], alpha=0.5, label='Original', bins=10)
    plt.hist(synthetic_data[col], alpha=0.5, label='Synthetic', bins=10)
    plt.title(f'Distribution of {col}')
    plt.legend()

plt.tight_layout()
plt.savefig('synthetic_data_comparison.png')

print("Synthetic data generated and saved to synthetic_customer_data.csv")
print("Distribution comparison saved to synthetic_data_comparison.png")

Best Practices

Choose the right method: - For simple datasets: use “simple” or “statistical” - For complex relationships: use “sdv-copula”, “sdv-ctgan”, or “copula”
Always identify sensitive columns: Either specify them explicitly or use the automatic detection feature.
Set a seed for reproducibility: This ensures you get the same results each time.
Evaluate your synthetic data: Compare the distributions and relationships against the original data.
Balance privacy and utility: Adjust parameters to find the right balance for your use case.
Handle sensitive data carefully: Make sure the synthetic data doesn’t leak any information from the original dataset.