Compliance
Compliance checking is a critical component of privacy-preserving machine learning. SecureML provides tools to verify that your datasets and models comply with relevant privacy regulations such as GDPR, CCPA, and HIPAA.
Core Concepts
Compliance Report: A structured report containing compliance check results, including issues, warnings, and passed checks.
Supported Regulations: SecureML supports checks against major privacy regulations:
GDPR: General Data Protection Regulation (European Union)
CCPA: California Consumer Privacy Act
HIPAA: Health Insurance Portability and Accountability Act
LGPD: Brazilian General Data Protection Law (Brazil)
Compliance Levels:
Dataset-level compliance: Detecting personal data and PHI in datasets
Model-level compliance: Verifying that models support privacy requirements
Pipeline-level compliance: Checking the entire machine learning pipeline
Basic Usage
Basic Compliance Check
The simplest way to check compliance is using the check_compliance function:
from secureml.compliance import check_compliance
import pandas as pd
# Sample dataset
data = pd.DataFrame({
'name': ['Alice Smith', 'Bob Johnson'],
'age': [30, 45],
'email': ['alice@example.com', 'bob@example.com'],
'diagnosis': ['Flu', 'Diabetes']
})
# Check compliance with GDPR
report = check_compliance(
data=data,
regulation="GDPR",
max_samples=100 # Maximum number of samples to analyze
)
# Print the report
print(report)
# Check compliance status
if report.has_issues():
print("Compliance issues found!")
for issue in report.issues:
print(f"- {issue['severity']}: {issue['issue']}")
print(f" Recommendation: {issue['recommendation']}")
Using ComplianceAuditor
For more comprehensive audits, use the ComplianceAuditor class:
from secureml.compliance import ComplianceAuditor
# Create an auditor for GDPR
auditor = ComplianceAuditor(
regulation="GDPR",
log_dir="audit_logs" # Optional: directory to store audit logs
)
Dataset Audit
Audit a dataset for compliance:
# Audit a dataset with metadata
dataset_report = auditor.audit_dataset(
dataset=data,
dataset_name="patient_records",
metadata={
"description": "Patient medical records",
"data_owner": "Hospital A",
"data_retention_period": "5 years",
"data_encrypted": True,
"data_storage_location": "EU"
}
)
# Print the report
print(dataset_report)
Model Audit
Audit a model for compliance:
# Model configuration
model_config = {
"model_type": "RandomForestClassifier",
"parameters": {
"n_estimators": 100,
"max_depth": 5
},
"supports_forget_request": True, # Supports GDPR right to be forgotten
"data_processing_purpose": "Medical diagnosis prediction"
}
# Audit the model
model_report = auditor.audit_model(
model_config=model_config,
model_name="diagnosis_predictor",
model_documentation={
"version": "1.0",
"training_date": "2024-01-01",
"training_data_description": "Patient records from 2023"
}
)
print(model_report)
Full Pipeline Audit
Audit an entire ML pipeline including preprocessing steps:
# Define preprocessing steps
preprocessing_steps = [
{
"name": "data_cleaning",
"type": "anonymization",
"input": "raw_data",
"output": "anonymized_data",
"parameters": {
"method": "k-anonymity",
"k": 2,
"sensitive_columns": ["name", "email", "phone"]
}
},
{
"name": "feature_selection",
"type": "minimization",
"input": "anonymized_data",
"output": "minimized_data",
"parameters": {
"selected_features": ["age", "diagnosis", "income"]
}
}
]
# Audit the entire pipeline
pipeline_report = auditor.audit_pipeline(
dataset=data,
dataset_name="patient_records",
model=model_config,
model_name="diagnosis_predictor",
preprocessing_steps=preprocessing_steps,
metadata={
"pipeline_version": "1.0",
"last_updated": "2024-01-01",
"data_owner": "Hospital A",
"data_encrypted": True
}
)
# The pipeline audit returns a dictionary with individual component reports
for component, report in pipeline_report.items():
print(f"\n{component.upper()} Report:")
print(report)
Generating PDF Reports
Generate a detailed PDF report of the compliance audit:
# Generate PDF report from pipeline audit
pdf_path = auditor.generate_pdf(
audit_result=pipeline_report,
output_file="compliance_report.pdf",
title="Patient Records Pipeline Compliance Audit",
logo_path="company_logo.png" # Optional
)
How Compliance Checks Work
Identifying Sensitive Data
SecureML uses several approaches to identify sensitive data:
Column name analysis: Checks column names against known patterns of sensitive data
Content analysis: Uses NLP techniques to identify patterns in text data
Automated detection: The _identify_sensitive_columns function can automatically detect potentially sensitive columns
from secureml.anonymization import _identify_sensitive_columns
# Automatically identify sensitive columns
sensitive_cols = _identify_sensitive_columns(data)
print(f"Automatically identified sensitive columns: {sensitive_cols}")
Regulation-Specific Checks
Each regulation has specific checks based on its requirements:
GDPR Checks: - Personal data identification - Special category data identification - Data minimization - Explicit consent - Right to be forgotten capability - Cross-border data transfer
CCPA Checks: - Personal information identification - California residents’ data handling - Sale of personal information - Deletion capability
HIPAA Checks: - Protected Health Information (PHI) identification - De-identification method verification - Data security and encryption
LGPD Checks: - Personal data identification - Sensitive data identification - Data minimization - Explicit consent - Right to be forgotten capability - Cross-border data transfer
Regulation Presets
SecureML uses presets for each regulation stored in YAML files. You can access preset information programmatically:
from secureml.presets import list_available_presets, load_preset, get_preset_field
# List available regulation presets
available_presets = list_available_presets()
print(f"Available regulations: {available_presets}")
# Load a specific preset
gdpr_preset = load_preset("gdpr")
# Get specific field from a preset
personal_data_identifiers = get_preset_field("gdpr", "personal_data_identifiers")
special_categories = get_preset_field("gdpr", "special_categories")
print(f"GDPR personal data identifiers: {personal_data_identifiers}")
Best Practices
Start early: Build compliance into your ML workflows from the beginning, not as an afterthought
Be comprehensive: Check compliance across all phases of the ML lifecycle, from data collection to model deployment
Document everything: Maintain detailed records of compliance checks and actions taken to address issues
Add appropriate metadata: Include information about data sources, consent, processing purpose, etc.
Regular audits: Schedule regular compliance audits of your ML systems
Integrate with audit trails: Use audit trails to document compliance activities
Remediate issues: Address identified compliance issues promptly
Stay updated: Keep abreast of changes in regulations that may affect compliance requirements
Further Reading
Compliance API - Complete API reference for compliance functions
Compliance Checking Examples - More examples of compliance checking techniques
/regulations/gdpr - Detailed guide on GDPR compliance
/regulations/ccpa - Detailed guide on CCPA compliance
/regulations/hipaa - Detailed guide on HIPAA compliance