Compliance API

This module provides tools to verify that datasets and models comply with privacy regulations like GDPR, CCPA, and HIPAA.

Main Functions

secureml.compliance.check_compliance(data: DataFrame | Dict[str, Any], model_config: Dict[str, Any] | None = None, regulation: str = 'GDPR', max_samples: int = 100, **kwargs: Any) → ComplianceReport

Check a dataset and model configuration for compliance with privacy regulations.

Uses NLP techniques to analyze dataset content for sensitive information.

Args:

data: The dataset to check (DataFrame or dict with metadata) model_config: Optional configuration of the model to check regulation: The regulation to check compliance against (‘GDPR’, ‘CCPA’,

‘HIPAA’, ‘LGPD’)

max_samples: Maximum number of samples to analyze for content **kwargs: Additional parameters for specific compliance checks

Returns:

A ComplianceReport with the results of the compliance check

Raises:

ValueError: If an unsupported regulation is specified

This is the main function for checking compliance with privacy regulations:

from secureml.compliance import check_compliance

# Check a dataset for GDPR compliance
report = check_compliance(
    data=my_dataframe,
    regulation="GDPR"
)

# Check if any issues were found
if report.has_issues():
    print(report)

Compliance Reports

class secureml.compliance.ComplianceReport(regulation: str)

A report containing the results of a compliance check.

__init__(regulation: str)

Initialize a compliance report.

Args:: regulation: The regulation the check was performed against

add_issue(component: str, issue: str, severity: str, recommendation: str) → None

Add an issue to the report.

Args:: component: The component where the issue was found issue: Description of the issue severity: Severity level (‘high’, ‘medium’, ‘low’) recommendation: Recommended action to resolve the issue

add_passed_check(check_name: str) → None

Add a passed check to the report.

Args:: check_name: Name of the check that passed

add_warning(component: str, warning: str, recommendation: str) → None

Add a warning to the report.

Args:: component: The component where the warning was triggered warning: Description of the warning recommendation: Recommended action to address the warning

generate_report(output_file: str, format: str = 'html', logo_path: str | None = None, include_charts: bool = True) → str

Generate a report from the compliance check results.

Args:: output_file: Path to write the report to format: Report format (‘html’ or ‘pdf’) logo_path: Path to a logo image include_charts: Whether to include charts in the report
Returns:: Path to the generated report

has_issues() → bool

Check if the report contains any issues.

Returns:: True if the report contains issues, False otherwise

has_warnings() → bool

Check if the report contains any warnings.

Returns:: True if the report contains warnings, False otherwise

summary() → Dict[str, Any]

Get a summary of the compliance report.

Returns:: A dictionary containing the summary information

The ComplianceReport class contains the results of a compliance check and provides methods for accessing and displaying those results:

# Access report summary
summary = report.summary()

# Get detailed information
if report.has_issues():
    for issue in report.issues:
        print(f"Issue: {issue['issue']}")
        print(f"Severity: {issue['severity']}")
        print(f"Recommendation: {issue['recommendation']}")

Compliance Auditor

class secureml.compliance.ComplianceAuditor(regulation: str = 'GDPR', log_dir: str | None = None)

A class for auditing ML pipelines for compliance with privacy regulations.

The ComplianceAuditor provides a higher-level interface for conducting compliance audits of ML pipelines, generating comprehensive audit trails, and producing detailed reports.

__init__(regulation: str = 'GDPR', log_dir: str | None = None)

Initialize a compliance auditor.

Args:: regulation: The regulation to audit against log_dir: Directory to store audit logs

audit_dataset(dataset: DataFrame | Dict[str, Any], dataset_name: str, metadata: Dict[str, Any] | None = None, max_samples: int = 100) → ComplianceReport

Audit a dataset for compliance.

Args:: dataset: The dataset to audit dataset_name: Name of the dataset metadata: Additional metadata about the dataset max_samples: Maximum number of samples to analyze for content
Returns:: A compliance report for the dataset

audit_model(model_config: Dict[str, Any], model_name: str, model_documentation: Dict[str, Any] | None = None) → ComplianceReport

Audit a model for compliance.

Args:: model_config: Configuration of the model model_name: Name of the model model_documentation: Additional documentation about the model
Returns:: A compliance report for the model

Audit an entire ML pipeline for compliance.

Args:: dataset: The dataset used in the pipeline dataset_name: Name of the dataset model: Model configuration or object model_name: Name of the model preprocessing_steps: List of preprocessing steps metadata: Additional metadata about the pipeline
Returns:: Dictionary containing compliance reports for each component

generate_pdf(audit_result: Dict[str, Any], output_file: str, title: str | None = None, logo_path: str | None = None) → str

Generate a PDF report from an audit result.

Args:: audit_result: The result of an audit_pipeline call output_file: Path to write the PDF report to title: Title for the report logo_path: Path to a logo image
Returns:: Path to the generated PDF

The ComplianceAuditor class provides a higher-level interface for conducting compliance audits of ML pipelines, generating comprehensive audit trails, and producing detailed reports:

from secureml.compliance import ComplianceAuditor

# Create an auditor for GDPR compliance
auditor = ComplianceAuditor(regulation="GDPR")

# Audit a dataset
dataset_report = auditor.audit_dataset(
    dataset=my_dataframe,
    dataset_name="customer_data"
)

# Audit a model
model_report = auditor.audit_model(
    model_config=model_params,
    model_name="credit_scoring_model"
)

# Audit an entire ML pipeline
pipeline_report = auditor.audit_pipeline(
    dataset=my_dataframe,
    dataset_name="customer_data",
    model=my_model,
    model_name="credit_scoring_model",
    preprocessing_steps=preprocessing_config
)

# Generate a PDF report
auditor.generate_pdf(
    pipeline_report,
    output_file="compliance_report.pdf",
    title="ML Pipeline Compliance Audit"
)

Data Identification Functions

secureml.compliance.identify_personal_data(data: DataFrame, max_samples: int = 100, personal_identifiers: List[str] | None = None, sensitive_categories: List[str] | None = None) → Dict[str, Any]

Identify personal data in a dataset.

Args:: data: The dataset to analyze max_samples: Maximum number of samples to analyze for content personal_identifiers: List of personal data identifiers to check for sensitive_categories: List of sensitive data categories to check for
Returns:: Dictionary with information about identified personal data

This function identifies personal data in a dataset:

from secureml.compliance import identify_personal_data

# Identify personal data in a dataframe
personal_data_info = identify_personal_data(
    data=my_dataframe,
    max_samples=200  # Analyze up to 200 samples for text content
)

# Check which columns contain personal data
personal_columns = personal_data_info["columns"]

# Check what personal data was found in text content
content_findings = personal_data_info["content_findings"]

secureml.compliance.identify_phi(data: DataFrame, max_samples: int = 100, phi_identifiers: List[str] | None = None) → Dict[str, Any]

Identify Protected Health Information (PHI) in a dataset.

Args:: data: The dataset to analyze max_samples: Maximum number of samples to analyze for content phi_identifiers: List of PHI identifiers to check for
Returns:: Dictionary with information about identified PHI

This function identifies Protected Health Information (PHI) in a dataset:

from secureml.compliance import identify_phi

# Identify PHI in a healthcare dataset
phi_info = identify_phi(
    data=healthcare_data,
    max_samples=100
)

# Check which columns contain PHI
phi_columns = phi_info["columns"]

NLP Utilities

secureml.compliance.get_nlp_model(model_name: str = 'en_core_web_sm') → Language

Load and return a SpaCy NLP model.

This function caches the model to avoid reloading it multiple times. If the model is not installed, it will attempt to download and install it.

Args:: model_name: Name of the SpaCy model to load
Returns:: Loaded SpaCy language model
Raises:: ImportError: If the specified model cannot be installed or loaded

This function loads and caches a SpaCy NLP model for text analysis:

from secureml.compliance import get_nlp_model

# Get the default SpaCy model
nlp = get_nlp_model()

# Analyze text for entities
doc = nlp("Patient John Doe was diagnosed with hypertension.")
entities = [(ent.text, ent.label_) for ent in doc.ents]

Working with Regulation Presets

The compliance module uses regulation-specific presets that define rules and checks for each regulation. These presets are loaded from the secureml.presets module:

from secureml.presets import list_available_presets, load_preset, get_preset_field

# List available regulations
regulations = list_available_presets()  # Returns ['gdpr', 'ccpa', 'hipaa', ...]

# Load GDPR preset
gdpr_preset = load_preset("gdpr")

# Get specific field from a preset
personal_data_identifiers = get_preset_field("gdpr", "personal_data_identifiers")

Supported Regulations

The module currently supports compliance checks for:

GDPR (General Data Protection Regulation) - Checks for personal data and special categories - Verifies data minimization - Checks for consent metadata - Verifies right-to-be-forgotten support
CCPA (California Consumer Privacy Act) - Checks for personal information disclosure - Verifies opt-out options for data sharing - Checks deletion request support
HIPAA (Health Insurance Portability and Accountability Act) - Identifies Protected Health Information (PHI) - Checks for proper de-identification - Verifies data encryption

Best Practices

Regular audits: Run compliance checks regularly, especially before training models
Document remediation: Document how compliance issues were addressed
Multi-regulation: Check against all regulations applicable to your jurisdiction
Full pipeline: Audit the entire ML pipeline, not just individual components
Update checks: Keep regulation presets updated as laws and interpretations change