Synthetic Data Generation Examples ================================ This section demonstrates how to generate synthetic data that preserves the statistical properties of your original data while ensuring privacy. SecureML provides multiple methods with different trade-offs between simplicity, statistical accuracy, and privacy protection. Basic Synthetic Data Generation ----------------------------- The simplest way to generate synthetic data is using the ``simple`` method: .. code-block:: python import pandas as pd from secureml.synthetic import generate_synthetic_data # Sample data with sensitive information data = pd.DataFrame({ 'name': ['John Smith', 'Jane Doe', 'Robert Johnson'], 'age': [34, 29, 42], 'gender': ['Male', 'Female', 'Male'], 'email': ['john.smith@example.com', 'jane.doe@example.com', 'robert.j@example.com'], 'income': [65000, 72000, 58000], 'credit_score': [720, 750, 680], 'zipcode': ['12345', '23456', '34567'] }) # Generate synthetic data using simple method synthetic_data = generate_synthetic_data( template=data, num_samples=20, method="simple", sensitive_columns=['name', 'email'], seed=42 ) print("Synthetic data sample:") print(synthetic_data.head()) The ``simple`` method generates values by sampling from observed distributions for non-sensitive columns, while using more sophisticated techniques (like Faker) for sensitive columns. Statistical Method ---------------- For better preservation of statistical relationships, use the ``statistical`` method: .. code-block:: python # Generate synthetic data with statistical method statistical_synthetic = generate_synthetic_data( template=data, num_samples=20, method="statistical", sensitive_columns=['name', 'email'], preserve_dtypes=True, preserve_outliers=True, categorical_threshold=10, handle_skewness=True, seed=42 ) # Compare correlations numeric_cols = ['age', 'income', 'credit_score'] print("Original correlation matrix:") print(data[numeric_cols].corr()) print("\nSynthetic correlation matrix:") print(statistical_synthetic[numeric_cols].corr()) The ``statistical`` method preserves: - Individual column distributions - Correlations between variables - Data types and value ranges Automatic Sensitive Column Detection ---------------------------------- SecureML can automatically detect columns likely to contain sensitive information: .. code-block:: python # Generate synthetic data with automatic sensitive column detection auto_synthetic = generate_synthetic_data( template=data, num_samples=20, method="statistical", sensitivity_detection={ "auto_detect": True, "confidence_threshold": 0.7, "sample_size": 100 }, seed=42 ) The sensitivity detection looks at both column names and data content patterns to identify personal identifiers, financial information, health data, and other sensitive categories. Schema-Based Generation --------------------- You can generate synthetic data directly from a schema without an existing dataset: .. code-block:: python # Define a schema for financial customer data schema = { "columns": { "customer_id": "int", "age": "int", "income": "float", "credit_score": "int", "education_level": "category", "employment_status": "category", "has_mortgage": "bool", "has_loan": "bool", "account_balance": "float" } } # Generate synthetic data from schema schema_synthetic = generate_synthetic_data( template=schema, num_samples=100, method="statistical", seed=42 ) Advanced Synthetic Methods ------------------------- SDV Integration Methods ^^^^^^^^^^^^^^^^^^^^^ SecureML integrates with the Synthetic Data Vault (SDV) library for more sophisticated generation methods: .. code-block:: python # SDV's Gaussian Copula method try: sdv_copula_synthetic = generate_synthetic_data( template=data, num_samples=100, method="sdv-copula", sensitive_columns=['name', 'email'], anonymize_fields=True, seed=42 ) except ImportError: print("SDV package not installed. Install with: pip install sdv") # SDV's CTGAN method (deep learning approach) try: sdv_ctgan_synthetic = generate_synthetic_data( template=data, num_samples=100, method="sdv-ctgan", sensitive_columns=['name', 'email'], anonymize_fields=True, epochs=300, batch_size=32, seed=42 ) except ImportError: print("SDV package not installed. Install with: pip install sdv") You can also specify constraints on the generated data: .. code-block:: python # Define constraints for SDV methods constraints = [ {"type": "unique", "columns": ["customer_id"]}, {"type": "fixed_combinations", "column_names": ["state", "city"]}, {"type": "inequality", "low_column": "min_salary", "high_column": "max_salary"} ] # Generate data with constraints sdv_synthetic = generate_synthetic_data( template=data, num_samples=100, method="sdv-copula", sensitive_columns=['name', 'email'], constraints=constraints, seed=42 ) GAN-based Method ^^^^^^^^^^^^^ For more complex distributions, use the GAN-based method: .. code-block:: python # Generate synthetic data using GAN method gan_synthetic = generate_synthetic_data( template=data, num_samples=100, method="gan", sensitive_columns=['name', 'email'], epochs=300, batch_size=32, generator_dim=[128, 128], discriminator_dim=[128, 128], learning_rate=0.001, noise_dim=100, preserve_dtypes=True, seed=42 ) Copula-based Method ^^^^^^^^^^^^^^^ The copula method captures complex dependencies between variables: .. code-block:: python # Generate synthetic data using copula method copula_synthetic = generate_synthetic_data( template=data, num_samples=100, method="copula", sensitive_columns=['name', 'email'], copula_type="gaussian", fit_method="ml", preserve_dtypes=True, handle_missing="mean", categorical_threshold=10, handle_skewness=True, seed=42 ) Comparing Methods --------------- Different synthetic generation methods have different strengths. Here's a comparison: .. code-block:: python import numpy as np # Number of samples to generate n_samples = 100 # Generate synthetic data with each method methods = ["simple", "statistical", "copula"] synthetic_datasets = {} for method in methods: synthetic_datasets[method] = generate_synthetic_data( template=data, num_samples=n_samples, method=method, sensitive_columns=['name', 'email'], seed=42 ) # Compare means and standard deviations of numeric columns numeric_cols = ['age', 'income', 'credit_score'] print(f"{'Column':<15} {'Metric':<10} {'Original':<10}", end="") for method in methods: print(f" {method.capitalize():<10}", end="") print() for col in numeric_cols: # Mean comparison print(f"{col:<15} {'Mean':<10} {data[col].mean():<10.2f}", end="") for method in methods: synthetic_mean = synthetic_datasets[method][col].mean() print(f" {synthetic_mean:<10.2f}", end="") print() # Std comparison print(f"{col:<15} {'Std':<10} {data[col].std():<10.2f}", end="") for method in methods: synthetic_std = synthetic_datasets[method][col].std() print(f" {synthetic_std:<10.2f}", end="") print() Evaluating Synthetic Data Quality ------------------------------- You can perform simple evaluations to check synthetic data quality: .. code-block:: python from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Scale the data scaler = StandardScaler() original_scaled = scaler.fit_transform(data[numeric_cols]) synthetic_scaled = scaler.transform(synthetic_datasets["statistical"][numeric_cols]) # Apply PCA pca = PCA(n_components=2) original_pca = pca.fit_transform(original_scaled) synthetic_pca = pca.transform(synthetic_scaled) # Calculate a simple statistical similarity score mse = 0 for col in numeric_cols: # Normalized mean difference mean_diff = (data[col].mean() - synthetic_datasets["statistical"][col].mean()) / data[col].mean() # Normalized std difference std_diff = (data[col].std() - synthetic_datasets["statistical"][col].std()) / data[col].std() mse += (mean_diff ** 2 + std_diff ** 2) mse /= (len(numeric_cols) * 2) # Average across columns and metrics print(f"Statistical similarity score (lower is better): {mse:.4f}") Complete Example -------------- Here's a complete example that generates synthetic data and compares distributions: .. code-block:: python import pandas as pd import numpy as np import matplotlib.pyplot as plt from secureml.synthetic import generate_synthetic_data # Create sample data data = pd.DataFrame({ 'name': ['John Smith', 'Jane Doe', 'Robert Johnson', 'Emily Williams', 'Michael Brown', 'Sarah Davis', 'David Miller', 'Lisa Wilson'], 'age': [34, 29, 42, 35, 51, 27, 38, 44], 'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'], 'email': ['john.smith@example.com', 'jane.doe@example.com', 'robert.j@example.com', 'e.williams@example.com', 'm.brown@example.com', 's.davis@example.com', 'david.m@example.com', 'lisa.wilson@example.com'], 'income': [65000, 72000, 58000, 93000, 81000, 67000, 79000, 82000], 'credit_score': [720, 750, 680, 790, 705, 740, 710, 760], 'zipcode': ['12345', '23456', '34567', '45678', '56789', '67890', '78901', '89012'] }) # Generate synthetic data synthetic_data = generate_synthetic_data( template=data, num_samples=100, method="statistical", sensitive_columns=['name', 'email'], sensitivity_detection={ "auto_detect": True, # Auto-detect additional sensitive columns "confidence_threshold": 0.7 }, preserve_dtypes=True, handle_skewness=True, seed=42 ) # Save the synthetic data synthetic_data.to_csv("synthetic_customer_data.csv", index=False) # Compare distributions numeric_cols = ['age', 'income', 'credit_score'] # Set up the figure plt.figure(figsize=(15, 5)) # Plot histograms for each numeric column for i, col in enumerate(numeric_cols): plt.subplot(1, 3, i+1) plt.hist(data[col], alpha=0.5, label='Original', bins=10) plt.hist(synthetic_data[col], alpha=0.5, label='Synthetic', bins=10) plt.title(f'Distribution of {col}') plt.legend() plt.tight_layout() plt.savefig('synthetic_data_comparison.png') print("Synthetic data generated and saved to synthetic_customer_data.csv") print("Distribution comparison saved to synthetic_data_comparison.png") Best Practices ------------ 1. **Choose the right method**: - For simple datasets: use "simple" or "statistical" - For complex relationships: use "sdv-copula", "sdv-ctgan", or "copula" 2. **Always identify sensitive columns**: Either specify them explicitly or use the automatic detection feature. 3. **Set a seed for reproducibility**: This ensures you get the same results each time. 4. **Evaluate your synthetic data**: Compare the distributions and relationships against the original data. 5. **Balance privacy and utility**: Adjust parameters to find the right balance for your use case. 6. **Handle sensitive data carefully**: Make sure the synthetic data doesn't leak any information from the original dataset.