================= Synthetic Data API ================= .. module:: secureml.synthetic This module provides tools to create synthetic datasets that mimic the statistical properties of real data without containing sensitive information, enabling privacy-preserving machine learning development and testing. Main Functions -------------- .. autofunction:: generate_synthetic_data The main function for generating synthetic data from a template dataset: .. code-block:: python from secureml.synthetic import generate_synthetic_data import pandas as pd # Load a real dataset as a template template_data = pd.read_csv("patient_data.csv") # Generate 1000 synthetic samples using the statistical method synthetic_data = generate_synthetic_data( template=template_data, num_samples=1000, method="statistical", sensitive_columns=["name", "email", "ssn", "address"], seed=42 # For reproducibility ) # Save the synthetic dataset synthetic_data.to_csv("synthetic_patient_data.csv", index=False) Supported Generation Methods --------------------------- The module supports several methods for generating synthetic data, each with different characteristics and trade-offs: 1. **Simple Method** (``method="simple"``) A basic approach that preserves individual column distributions but not complex relationships between variables. .. code-block:: python # Generate simple synthetic data simple_synthetic = generate_synthetic_data( template=data, num_samples=500, method="simple" ) 2. **Statistical Method** (``method="statistical"``) A more sophisticated approach that preserves correlations and statistical properties. .. code-block:: python # Generate synthetic data preserving statistical properties statistical_synthetic = generate_synthetic_data( template=data, num_samples=500, method="statistical", handle_skewness=True, preserve_outliers=False ) 3. **SDV Methods** (``method="sdv-copula"``, ``method="sdv-ctgan"``, ``method="sdv-tvae"``) Leverages the Synthetic Data Vault library for advanced generative models: .. code-block:: python # Generate synthetic data using a Gaussian Copula model copula_synthetic = generate_synthetic_data( template=data, num_samples=500, method="sdv-copula", constraints=[ {"type": "unique", "columns": ["id"]} ] ) # Generate synthetic data using a GAN-based model ctgan_synthetic = generate_synthetic_data( template=data, num_samples=500, method="sdv-ctgan", epochs=100 ) 4. **Direct Methods** (``method="gan"``, ``method="copula"``) Direct implementations of generative adversarial networks and copulas: .. code-block:: python # Generate synthetic data using a GAN model gan_synthetic = generate_synthetic_data( template=data, num_samples=500, method="gan", epochs=300, batch_size=32 ) Sensitive Data Handling ---------------------- The module provides automatic detection and special handling for sensitive columns: .. code-block:: python # Automatic detection of sensitive columns synthetic_data = generate_synthetic_data( template=data, num_samples=1000, method="statistical", sensitivity_detection={ "auto_detect": True, "confidence_threshold": 0.7, "sample_size": 100 } ) # Explicitly specify sensitive columns synthetic_data = generate_synthetic_data( template=data, num_samples=1000, method="statistical", sensitive_columns=["email", "phone", "ssn", "medical_record"] ) Creating Synthetic Data from Schema --------------------------------- You can also generate synthetic data from a schema definition without a template dataset: .. code-block:: python # Define a schema schema = { "columns": { "age": "int", "income": "float", "education": "category", "marital_status": "category" } } # Generate synthetic data from schema synthetic_from_schema = generate_synthetic_data( template=schema, num_samples=1000, method="statistical" ) Using Constraints with SDV Methods -------------------------------- When using SDV-based methods, you can apply constraints to the synthetic data: .. code-block:: python # Generate synthetic data with constraints synthetic_with_constraints = generate_synthetic_data( template=data, num_samples=500, method="sdv-copula", constraints=[ {"type": "unique", "columns": ["id"]}, {"type": "fixed_combinations", "column_names": ["city", "state"]}, {"type": "inequality", "low_column": "start_date", "high_column": "end_date"} ] ) Helper Functions -------------- The module also provides several helper functions that are used internally: - ``_identify_sensitive_columns``: Automatically identifies columns containing sensitive data - ``_generate_simple_synthetic``: Implements the simple generation method - ``_generate_statistical_synthetic``: Implements the statistical generation method - ``_generate_sdv_synthetic``: Implements SDV-based generation methods - ``_generate_gan_synthetic``: Implements GAN-based generation - ``_generate_copula_synthetic``: Implements copula-based generation Best Practices ------------- 1. **Start Simple**: Begin with simpler methods like "simple" or "statistical" before trying more complex models 2. **Evaluate Quality**: Compare synthetic data distributions with the original data 3. **Handle Sensitive Data**: Always specify sensitive columns or enable auto-detection 4. **Set Seed**: Use the seed parameter for reproducible results 5. **Balance Privacy and Utility**: More complex methods may preserve utility better but might have privacy implications 6. **Constraints Matter**: Use constraints with SDV methods to ensure business rules are preserved