CLI Examples
This section demonstrates how to use the SecureML command-line interface through practical examples. You can use these examples as a starting point for your own privacy-preserving data workflows.
Basic Setup
To use the CLI, make sure you have SecureML installed:
pip install secureml
Getting help information:
# Show general help
secureml --help
# Show help for a specific command
secureml anonymization --help
# Show version information
secureml --version
Anonymization Examples
Applying k-anonymity to protect sensitive data:
# Basic k-anonymity with k=3
secureml anonymization k-anonymize patient_data.csv anonymized_data.csv \
--quasi-id age --quasi-id zipcode \
--sensitive diagnosis --sensitive income \
--k 3
# Using a different output format
secureml anonymization k-anonymize patient_data.csv anonymized_data.json \
--quasi-id age --quasi-id zipcode \
--sensitive diagnosis \
--k 2 \
--format json
Compliance Checking Examples
Verifying compliance with privacy regulations:
# Basic GDPR compliance check
secureml compliance check patient_data.csv \
--regulation GDPR
# Compliance check with metadata and HTML output
secureml compliance check patient_data.csv \
--regulation GDPR \
--metadata metadata.json \
--output gdpr_report.html \
--format html
Example metadata.json file:
{
"description": "Patient health data",
"data_owner": "Example Hospital",
"data_retention_period": "5 years",
"data_encrypted": true,
"data_storage_location": "EU",
"consent_obtained": true,
"consent_date": "2023-01-15"
}
Checking both dataset and model compliance:
# Comprehensive HIPAA compliance check
secureml compliance check patient_data.csv \
--regulation HIPAA \
--metadata metadata.json \
--model-config model_config.json \
--output hipaa_report.pdf \
--format pdf
Example model_config.json file:
{
"model_type": "RandomForestClassifier",
"parameters": {
"n_estimators": 100,
"max_depth": 5
},
"supports_forget_request": true,
"supports_deletion_request": true,
"data_processing_purpose": "Medical diagnosis prediction",
"model_storage_location": "EU"
}
Synthetic Data Generation Examples
Creating synthetic datasets based on real data:
# Basic statistical synthesis
secureml synthetic generate patient_data.csv synthetic_data.csv \
--method statistical \
--samples 1000
# Auto-detecting sensitive columns
secureml synthetic generate patient_data.csv synthetic_data.csv \
--method statistical \
--auto-detect-sensitive \
--sensitivity-confidence 0.7 \
--sensitivity-sample-size 200 \
--samples 1000
# Using GAN-based synthesis with specific sensitive columns
secureml synthetic generate patient_data.csv synthetic_data.parquet \
--method gan \
--sensitive name --sensitive email --sensitive diagnosis \
--epochs 300 --batch-size 32 \
--samples 500 \
--format parquet
Regulation Presets Examples
Working with regulation presets:
# List all available regulation presets
secureml presets list
# View the GDPR preset
secureml presets show gdpr
# Extract just the personal data identifiers field from GDPR
secureml presets show gdpr --field personal_data_identifiers
# Save the entire HIPAA preset to a file
secureml presets show hipaa --output hipaa_preset.json
Isolated Environment Examples
Managing isolated environments for conflicting dependencies:
# Set up the TensorFlow Privacy environment
secureml environments setup-tf-privacy
# Check if environments are properly configured
secureml environments info
# Force recreation of an environment
secureml environments setup-tf-privacy --force
Key Management Examples
Working with encryption keys (requires HashiCorp Vault):
# Configure Vault connection
secureml keys configure-vault \
--vault-url https://vault.example.com:8200 \
--vault-token hvs.example_token \
--vault-path secureml
# Test Vault connection
secureml keys configure-vault --test-connection
# Generate a new encryption key
secureml keys generate-key \
--key-name patient_data_key \
--length 32 \
--encoding hex
# Retrieve a key
secureml keys get-key \
--key-name patient_data_key \
--encoding base64
Using environment variables for safer key management:
# Set environment variables instead of passing tokens directly
export SECUREML_VAULT_URL=https://vault.example.com:8200
export SECUREML_VAULT_TOKEN=hvs.example_token
# The command now uses environment variables automatically
secureml keys get-key --key-name patient_data_key
End-to-End Example Workflow
A complete workflow for processing sensitive health data:
# 1. Check compliance of the original dataset
secureml compliance check patient_data.csv \
--regulation GDPR \
--output compliance_original.html \
--format html
# 2. Anonymize the dataset for safe processing
secureml anonymization k-anonymize patient_data.csv anonymized_data.csv \
--quasi-id age --quasi-id zipcode \
--sensitive diagnosis --sensitive income \
--k 3
# 3. Check compliance of the anonymized dataset
secureml compliance check anonymized_data.csv \
--regulation GDPR \
--output compliance_anonymized.html \
--format html
# 4. Generate synthetic data for sharing with researchers
secureml synthetic generate anonymized_data.csv synthetic_data.csv \
--method statistical \
--auto-detect-sensitive \
--samples 1000
# 5. Final compliance check on the synthetic data
secureml compliance check synthetic_data.csv \
--regulation GDPR \
--output compliance_synthetic.html \
--format html
Processing Multiple Files
Example shell script for batch processing:
#!/bin/bash
# Directory containing data files
DATA_DIR="patient_data"
# Process each CSV file in the directory
for file in "$DATA_DIR"/*.csv; do
filename=$(basename "$file" .csv)
echo "Processing $filename..."
# Check compliance
secureml compliance check "$file" \
--regulation GDPR \
--output "reports/${filename}_compliance.html" \
--format html
# Anonymize data
secureml anonymization k-anonymize "$file" \
"anonymized/${filename}_anon.csv" \
--quasi-id age --quasi-id zipcode \
--sensitive diagnosis --sensitive income \
--k 3
# Generate synthetic data
secureml synthetic generate "anonymized/${filename}_anon.csv" \
"synthetic/${filename}_synth.csv" \
--method statistical \
--samples 1000
echo "$filename completed."
done
echo "All files processed."
Performance Considerations
For large datasets, consider these performance tips:
Batch processing: Process large files in batches rather than all at once
Sample data first: Test your commands on a small sample before processing the entire dataset
Choose appropriate output formats: For large datasets, parquet format may be more efficient
Monitor resources: Some operations (especially GAN-based synthetic data generation) can be resource-intensive
# Process only a subset of records for testing
head -n 1000 large_dataset.csv > sample_dataset.csv
# Test your workflow on the sample
secureml synthetic generate sample_dataset.csv synthetic_sample.csv \
--method statistical \
--samples 500
# If satisfied, process the full dataset with parquet output
secureml synthetic generate large_dataset.csv synthetic_full.parquet \
--method statistical \
--samples 10000 \
--format parquet