Differential Privacy
Differential privacy (DP) provides a mathematical framework for quantifying and limiting the privacy risk when training machine learning models. SecureML implements state-of-the-art differential privacy techniques that allow you to train models with formal privacy guarantees.
Core Concepts
Epsilon (ε): The privacy budget that quantifies the maximum privacy loss. Lower values provide stronger privacy guarantees.
Delta (δ): The probability of information leakage beyond what is allowed by epsilon. This should be very small (typically less than 1/n where n is the dataset size).
Sensitivity: The maximum influence a single data point can have on the output.
Noise Mechanisms: Algorithms that add calibrated noise to protect privacy (Laplace, Gaussian, etc.).
Basic Usage
Training a Differentially Private Model
The simplest way to train a model with differential privacy is to use the differentially_private_train function:
from secureml.privacy import differentially_private_train
import tensorflow as tf # or import torch for PyTorch
# Create a model (TensorFlow example)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Train the model with differential privacy
dp_model = differentially_private_train(
model=model,
data=training_data, # DataFrame or numpy array
epsilon=1.0, # Privacy budget
delta=1e-5, # Probability of privacy breach
max_grad_norm=1.0, # Maximum gradient norm for clipping
noise_multiplier=None, # If None, calculated from epsilon and delta
batch_size=64,
epochs=10
)
# Make predictions with the differentially private model
predictions = dp_model.predict(X_test)
SecureML automatically detects whether you’re using PyTorch or TensorFlow based on the model you provide. You can also explicitly specify the framework:
# Specify the framework explicitly
dp_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
framework="tensorflow" # or "pytorch"
)
Supported Frameworks
SecureML supports differential privacy for multiple ML frameworks:
PyTorch Integration with Opacus
For PyTorch models, SecureML uses the Opacus library under the hood:
import torch
import torch.nn as nn
# Define a PyTorch model
model = nn.Sequential(
nn.Linear(input_size, 128),
nn.ReLU(),
nn.Linear(128, output_size)
)
# Train with differential privacy
dp_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
batch_size=64,
epochs=10,
criterion=torch.nn.CrossEntropyLoss(),
learning_rate=0.001,
validation_split=0.2
)
TensorFlow Integration with TensorFlow Privacy
For TensorFlow models, SecureML uses TensorFlow Privacy in an isolated environment:
import tensorflow as tf
# Create a Keras model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(input_size,)),
tf.keras.layers.Dense(output_size, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train with differential privacy
dp_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
batch_size=64,
epochs=10,
early_stopping_patience=3
)
TensorFlow Privacy and Isolated Environments
When using TensorFlow Privacy with SecureML, the library uses an isolated environment to handle dependency conflicts. This is all managed automatically for you.
What Happens Behind the Scenes
When you specify framework="tensorflow" in the differentially_private_train function:
SecureML checks if a TensorFlow Privacy isolated environment exists
If not, it creates one automatically (there may be a delay during this first-time setup)
Your model and data are serialized and sent to the isolated environment
Training happens in the isolated environment
The trained model is returned to your main environment
from secureml.privacy import differentially_private_train
import tensorflow as tf
# Create a model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train with differential privacy using TensorFlow Privacy
# This automatically uses the isolated environment
private_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
epochs=10,
batch_size=32,
framework="tensorflow" # This triggers the isolated environment
)
# Use the model as normal
predictions = private_model.predict(test_data)
Pre-setup for Faster Execution
To avoid delays during your first run, you can set up the TensorFlow Privacy environment in advance:
secureml environments setup-tf-privacy
For more detailed information on how SecureML manages isolated environments, see the Isolated Environments section.
Advanced Training Options
Both PyTorch and TensorFlow integrations support additional training parameters:
# Common parameters for both frameworks
dp_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
batch_size=64,
epochs=10,
learning_rate=0.001,
validation_split=0.2,
shuffle=True,
verbose=True,
early_stopping_patience=5 # Stop training if validation loss doesn't improve
)
Data Preparation
The differentially_private_train function can handle both DataFrames and NumPy arrays:
# Using a DataFrame
dp_model = differentially_private_train(
model=model,
data=df, # DataFrame where the last column is the target by default
target_column="label", # Specify a different target column if needed
epsilon=1.0,
delta=1e-5
)
# Using NumPy arrays
dp_model = differentially_private_train(
model=model,
data=np.concatenate([X, y.reshape(-1, 1)], axis=1), # Concatenate features and labels
epsilon=1.0,
delta=1e-5
)
Monitoring Privacy Budget
Both frameworks provide information about the actual privacy budget spent during training. This is displayed in the output if verbose=True:
# Train with differential privacy
dp_model = differentially_private_train(
model=model,
data=training_data,
epsilon=1.0,
delta=1e-5,
verbose=True # Will show privacy budget spent after training
)
In PyTorch (Opacus), you can also manually query the spent privacy budget:
from opacus import PrivacyEngine
# After training with Opacus, the privacy engine has a get_epsilon method
privacy_engine = PrivacyEngine()
# Training code...
# Get the privacy budget spent
spent_epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Privacy budget spent (ε = {spent_epsilon:.4f})")
Integration with Federated Learning
SecureML supports combining differential privacy with federated learning:
from secureml.federated import start_federated_client
# Start a federated learning client with differential privacy
start_federated_client(
model=model,
data=client_data,
server_address="localhost:8080",
apply_differential_privacy=True,
epsilon=1.0,
delta=1e-5,
max_grad_norm=1.0
)
Best Practices
Start with a higher epsilon: Begin with a higher privacy budget (e.g., ε=10) and gradually reduce it to find the right balance
Use larger batch sizes: Larger batches reduce the amount of noise needed
Pre-train on public data: Initialize models with public data before fine-tuning with differential privacy on sensitive data
Simplify models: Simpler models often require less privacy budget
Monitor training curves: Watch for signs of excessive noise affecting convergence
Manually set noise_multiplier: If the auto-calculated noise is too high, try manually setting a lower value
Tune the clipping threshold: Find the optimal gradient clipping threshold for your specific problem
Further Reading
Differential Privacy API - Complete API reference for differential privacy functions
Differential Privacy Examples - More examples of differential privacy techniques
The Algorithmic Foundations of Differential Privacy - Foundational paper by Dwork and Roth