==================== Synthetic Data ==================== Synthetic data generation creates artificial data that statistically resembles real data while protecting privacy. SecureML provides several techniques to generate high-quality synthetic data that maintains the utility of the original dataset without exposing sensitive information. Core Concepts ------------ **Statistical Preservation**: Maintaining the statistical properties of the original data, such as distributions, correlations, and dependencies. **Privacy Guarantees**: Ensuring that the synthetic data doesn't leak information about specific individuals in the original dataset. **Utility Preservation**: Ensuring models trained on synthetic data perform similarly to those trained on real data. Basic Usage ---------- Generating Synthetic Data ^^^^^^^^^^^^^^^^^^^^^^^ The main function for generating synthetic data is ``generate_synthetic_data``: .. code-block:: python from secureml.synthetic import generate_synthetic_data # Generate synthetic data synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="statistical", # Options: "simple", "statistical", "sdv-copula", "sdv-ctgan", "sdv-tvae", "gan", "copula" sensitive_columns=["name", "email", "ssn", "phone"], seed=42 ) Automatic Detection of Sensitive Columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SecureML can automatically detect sensitive columns in your data: .. code-block:: python from secureml.synthetic import generate_synthetic_data # Generate synthetic data with automatic sensitive column detection synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="statistical", sensitivity_detection={ "auto_detect": True, "confidence_threshold": 0.7, "sample_size": 100 } ) Supported Methods --------------- SecureML supports several synthetic data generation methods: Simple Random Sampling ^^^^^^^^^^^^^^^^^^^ Basic method suitable for quick prototyping: .. code-block:: python # Generate synthetic data using simple method synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="simple", sensitive_columns=["name", "email", "ssn"] ) Statistical Method ^^^^^^^^^^^^^^^ More sophisticated method that preserves statistical relationships: .. code-block:: python # Generate synthetic data using statistical method synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="statistical", sensitive_columns=["name", "email", "ssn"], preserve_dtypes=True, preserve_outliers=True, categorical_threshold=20, handle_skewness=True, seed=42 ) SDV Integration Methods ^^^^^^^^^^^^^^^^^^^^^ Integration with the Synthetic Data Vault (SDV) library for advanced generation (requires SDV to be installed): .. code-block:: python # Generate synthetic data using SDV's Gaussian Copula synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="sdv-copula", sensitive_columns=["name", "email", "ssn"], anonymize_fields=True ) # Generate synthetic data using SDV's CTGAN synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="sdv-ctgan", sensitive_columns=["name", "email", "ssn"], anonymize_fields=True, epochs=300, batch_size=500 ) # Generate synthetic data using SDV's TVAE synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="sdv-tvae", sensitive_columns=["name", "email", "ssn"], anonymize_fields=True ) GAN-based Method ^^^^^^^^^^^^^ Generative Adversarial Network approach (without requiring SDV): .. code-block:: python # Generate synthetic data using GAN method synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="gan", sensitive_columns=["name", "email", "ssn"], epochs=300, batch_size=32, generator_dim=[128, 128], discriminator_dim=[128, 128], learning_rate=0.001, noise_dim=100, preserve_dtypes=True ) Copula-based Method ^^^^^^^^^^^^^^^ Copula method for capturing variable dependencies: .. code-block:: python # Generate synthetic data using copula method synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="copula", sensitive_columns=["name", "email", "ssn"], copula_type="gaussian", fit_method="ml", preserve_dtypes=True, handle_missing="mean", categorical_threshold=20, handle_skewness=True, seed=42 ) Providing Data Schema Instead of Template DataFrame ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can generate synthetic data using a schema definition instead of an actual DataFrame: .. code-block:: python # Define a schema schema = { "columns": { "age": "int", "income": "float", "gender": "category", "education": "category" } } # Generate synthetic data from schema synthetic_df = generate_synthetic_data( template=schema, num_samples=1000, method="statistical" ) Advanced Usage ------------- SDV Constraints ^^^^^^^^^^^^ When using SDV methods, you can specify constraints on the generated data: .. code-block:: python # Define constraints for SDV methods constraints = [ {"type": "unique", "columns": ["id"]}, {"type": "fixed_combinations", "column_names": ["state", "city"]}, {"type": "inequality", "low_column": "min_salary", "high_column": "max_salary"} ] # Generate synthetic data with constraints synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="sdv-copula", constraints=constraints ) Handling Sensitive Information ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The function automatically generates realistic but fake data for sensitive columns: .. code-block:: python # Generate synthetic data with sensitive column handling synthetic_df = generate_synthetic_data( template=original_df, num_samples=1000, method="statistical", sensitive_columns=["name", "email", "phone", "ssn", "credit_card"] ) Best Practices ------------- 1. **Choose the right method**: Select the generation method based on your data characteristics: - For simple datasets with low complexity: "simple" - For general-purpose generation with good statistical properties: "statistical" - For complex tabular data with mixed types: "sdv-ctgan" or "sdv-tvae" - For data with important correlations: "sdv-copula" or "copula" 2. **Automatic sensitive column detection**: When in doubt about which columns are sensitive, use the automatic detection feature. 3. **Seed for reproducibility**: Always set a seed when you need reproducible results. 4. **Evaluate your synthetic data**: Check that the synthetic data preserves important statistical properties while providing sufficient privacy protection. 5. **Balance privacy and utility**: Adjust parameters to find the right balance between privacy protection and synthetic data utility. Example Workflow -------------- Complete workflow for generating and checking synthetic data: .. code-block:: python import pandas as pd from secureml.synthetic import generate_synthetic_data # Load original data original_df = pd.read_csv("customer_data.csv") # Generate synthetic data with automatic sensitive column detection synthetic_df = generate_synthetic_data( template=original_df, num_samples=len(original_df), method="statistical", sensitivity_detection={"auto_detect": True, "confidence_threshold": 0.7}, seed=42, preserve_dtypes=True, handle_skewness=True ) # Save synthetic data synthetic_df.to_csv("synthetic_customer_data.csv", index=False) # Basic validation - check column distributions for col in original_df.select_dtypes(include=['number']).columns: print(f"Column: {col}") print(f"Original mean: {original_df[col].mean()}, std: {original_df[col].std()}") print(f"Synthetic mean: {synthetic_df[col].mean()}, std: {synthetic_df[col].std()}") print() Further Reading ------------- * :doc:`/api/synthetic_data` - Complete API reference for synthetic data functions * :doc:`/examples/synthetic_data` - More examples of synthetic data generation techniques