Compliance API
This module provides tools to verify that datasets and models comply with privacy regulations like GDPR, CCPA, and HIPAA.
Main Functions
- secureml.compliance.check_compliance(data: DataFrame | Dict[str, Any], model_config: Dict[str, Any] | None = None, regulation: str = 'GDPR', max_samples: int = 100, **kwargs: Any) ComplianceReport
Check a dataset and model configuration for compliance with privacy regulations.
Uses NLP techniques to analyze dataset content for sensitive information.
- Args:
data: The dataset to check (DataFrame or dict with metadata) model_config: Optional configuration of the model to check regulation: The regulation to check compliance against (‘GDPR’, ‘CCPA’,
‘HIPAA’, ‘LGPD’)
max_samples: Maximum number of samples to analyze for content **kwargs: Additional parameters for specific compliance checks
- Returns:
A ComplianceReport with the results of the compliance check
- Raises:
ValueError: If an unsupported regulation is specified
This is the main function for checking compliance with privacy regulations:
from secureml.compliance import check_compliance
# Check a dataset for GDPR compliance
report = check_compliance(
data=my_dataframe,
regulation="GDPR"
)
# Check if any issues were found
if report.has_issues():
print(report)
Compliance Reports
- class secureml.compliance.ComplianceReport(regulation: str)
A report containing the results of a compliance check.
- __init__(regulation: str)
Initialize a compliance report.
- Args:
regulation: The regulation the check was performed against
- add_issue(component: str, issue: str, severity: str, recommendation: str) None
Add an issue to the report.
- Args:
component: The component where the issue was found issue: Description of the issue severity: Severity level (‘high’, ‘medium’, ‘low’) recommendation: Recommended action to resolve the issue
- add_passed_check(check_name: str) None
Add a passed check to the report.
- Args:
check_name: Name of the check that passed
- add_warning(component: str, warning: str, recommendation: str) None
Add a warning to the report.
- Args:
component: The component where the warning was triggered warning: Description of the warning recommendation: Recommended action to address the warning
- generate_report(output_file: str, format: str = 'html', logo_path: str | None = None, include_charts: bool = True) str
Generate a report from the compliance check results.
- Args:
output_file: Path to write the report to format: Report format (‘html’ or ‘pdf’) logo_path: Path to a logo image include_charts: Whether to include charts in the report
- Returns:
Path to the generated report
- has_issues() bool
Check if the report contains any issues.
- Returns:
True if the report contains issues, False otherwise
- has_warnings() bool
Check if the report contains any warnings.
- Returns:
True if the report contains warnings, False otherwise
- summary() Dict[str, Any]
Get a summary of the compliance report.
- Returns:
A dictionary containing the summary information
The ComplianceReport class contains the results of a compliance check and provides methods for accessing and displaying those results:
# Access report summary
summary = report.summary()
# Get detailed information
if report.has_issues():
for issue in report.issues:
print(f"Issue: {issue['issue']}")
print(f"Severity: {issue['severity']}")
print(f"Recommendation: {issue['recommendation']}")
Compliance Auditor
- class secureml.compliance.ComplianceAuditor(regulation: str = 'GDPR', log_dir: str | None = None)
A class for auditing ML pipelines for compliance with privacy regulations.
The ComplianceAuditor provides a higher-level interface for conducting compliance audits of ML pipelines, generating comprehensive audit trails, and producing detailed reports.
- __init__(regulation: str = 'GDPR', log_dir: str | None = None)
Initialize a compliance auditor.
- Args:
regulation: The regulation to audit against log_dir: Directory to store audit logs
- audit_dataset(dataset: DataFrame | Dict[str, Any], dataset_name: str, metadata: Dict[str, Any] | None = None, max_samples: int = 100) ComplianceReport
Audit a dataset for compliance.
- Args:
dataset: The dataset to audit dataset_name: Name of the dataset metadata: Additional metadata about the dataset max_samples: Maximum number of samples to analyze for content
- Returns:
A compliance report for the dataset
- audit_model(model_config: Dict[str, Any], model_name: str, model_documentation: Dict[str, Any] | None = None) ComplianceReport
Audit a model for compliance.
- Args:
model_config: Configuration of the model model_name: Name of the model model_documentation: Additional documentation about the model
- Returns:
A compliance report for the model
- audit_pipeline(dataset: DataFrame | Dict[str, Any] | None = None, dataset_name: str | None = None, model: Dict[str, Any] | None = None, model_name: str | None = None, preprocessing_steps: List[Dict[str, Any]] | None = None, metadata: Dict[str, Any] | None = None) Dict[str, Any]
Audit an entire ML pipeline for compliance.
- Args:
dataset: The dataset used in the pipeline dataset_name: Name of the dataset model: Model configuration or object model_name: Name of the model preprocessing_steps: List of preprocessing steps metadata: Additional metadata about the pipeline
- Returns:
Dictionary containing compliance reports for each component
- generate_pdf(audit_result: Dict[str, Any], output_file: str, title: str | None = None, logo_path: str | None = None) str
Generate a PDF report from an audit result.
- Args:
audit_result: The result of an audit_pipeline call output_file: Path to write the PDF report to title: Title for the report logo_path: Path to a logo image
- Returns:
Path to the generated PDF
The ComplianceAuditor class provides a higher-level interface for conducting compliance audits of ML pipelines, generating comprehensive audit trails, and producing detailed reports:
from secureml.compliance import ComplianceAuditor
# Create an auditor for GDPR compliance
auditor = ComplianceAuditor(regulation="GDPR")
# Audit a dataset
dataset_report = auditor.audit_dataset(
dataset=my_dataframe,
dataset_name="customer_data"
)
# Audit a model
model_report = auditor.audit_model(
model_config=model_params,
model_name="credit_scoring_model"
)
# Audit an entire ML pipeline
pipeline_report = auditor.audit_pipeline(
dataset=my_dataframe,
dataset_name="customer_data",
model=my_model,
model_name="credit_scoring_model",
preprocessing_steps=preprocessing_config
)
# Generate a PDF report
auditor.generate_pdf(
pipeline_report,
output_file="compliance_report.pdf",
title="ML Pipeline Compliance Audit"
)
Data Identification Functions
- secureml.compliance.identify_personal_data(data: DataFrame, max_samples: int = 100, personal_identifiers: List[str] | None = None, sensitive_categories: List[str] | None = None) Dict[str, Any]
Identify personal data in a dataset.
- Args:
data: The dataset to analyze max_samples: Maximum number of samples to analyze for content personal_identifiers: List of personal data identifiers to check for sensitive_categories: List of sensitive data categories to check for
- Returns:
Dictionary with information about identified personal data
This function identifies personal data in a dataset:
from secureml.compliance import identify_personal_data
# Identify personal data in a dataframe
personal_data_info = identify_personal_data(
data=my_dataframe,
max_samples=200 # Analyze up to 200 samples for text content
)
# Check which columns contain personal data
personal_columns = personal_data_info["columns"]
# Check what personal data was found in text content
content_findings = personal_data_info["content_findings"]
- secureml.compliance.identify_phi(data: DataFrame, max_samples: int = 100, phi_identifiers: List[str] | None = None) Dict[str, Any]
Identify Protected Health Information (PHI) in a dataset.
- Args:
data: The dataset to analyze max_samples: Maximum number of samples to analyze for content phi_identifiers: List of PHI identifiers to check for
- Returns:
Dictionary with information about identified PHI
This function identifies Protected Health Information (PHI) in a dataset:
from secureml.compliance import identify_phi
# Identify PHI in a healthcare dataset
phi_info = identify_phi(
data=healthcare_data,
max_samples=100
)
# Check which columns contain PHI
phi_columns = phi_info["columns"]
NLP Utilities
- secureml.compliance.get_nlp_model(model_name: str = 'en_core_web_sm') Language
Load and return a SpaCy NLP model.
This function caches the model to avoid reloading it multiple times. If the model is not installed, it will attempt to download and install it.
- Args:
model_name: Name of the SpaCy model to load
- Returns:
Loaded SpaCy language model
- Raises:
ImportError: If the specified model cannot be installed or loaded
This function loads and caches a SpaCy NLP model for text analysis:
from secureml.compliance import get_nlp_model
# Get the default SpaCy model
nlp = get_nlp_model()
# Analyze text for entities
doc = nlp("Patient John Doe was diagnosed with hypertension.")
entities = [(ent.text, ent.label_) for ent in doc.ents]
Working with Regulation Presets
The compliance module uses regulation-specific presets that define rules and checks for each regulation. These presets are loaded from the secureml.presets module:
from secureml.presets import list_available_presets, load_preset, get_preset_field
# List available regulations
regulations = list_available_presets() # Returns ['gdpr', 'ccpa', 'hipaa', ...]
# Load GDPR preset
gdpr_preset = load_preset("gdpr")
# Get specific field from a preset
personal_data_identifiers = get_preset_field("gdpr", "personal_data_identifiers")
Supported Regulations
The module currently supports compliance checks for:
GDPR (General Data Protection Regulation) - Checks for personal data and special categories - Verifies data minimization - Checks for consent metadata - Verifies right-to-be-forgotten support
CCPA (California Consumer Privacy Act) - Checks for personal information disclosure - Verifies opt-out options for data sharing - Checks deletion request support
HIPAA (Health Insurance Portability and Accountability Act) - Identifies Protected Health Information (PHI) - Checks for proper de-identification - Verifies data encryption
Best Practices
Regular audits: Run compliance checks regularly, especially before training models
Document remediation: Document how compliance issues were addressed
Multi-regulation: Check against all regulations applicable to your jurisdiction
Full pipeline: Audit the entire ML pipeline, not just individual components
Update checks: Keep regulation presets updated as laws and interpretations change