2026-05-20 13:34:48 +00:00
7 changed files with 372 additions and 83 deletions
--- a/README.md
+++ b/README.md
@@ -1,68 +1,79 @@
 # Semantic Document Analysis
-A UAPF Level-4 process package for extracting VDVC-conformant semantic
+UAPF Level-4 process for semantic analysis of free-text documents,
-metadata from free-text documents.
+governed by **UAPF v2.3.0** (Algorithm Cards).
-## What this package is
+## What this package does
-A **real, inspectable process** — not a single AI call in BPMN costume.
+Three BPMN service tasks invoke three UAPF-IP host capabilities:
-The flow has six executable nodes; three of them are DMN decision tables
+
-that carry the actual algorithm, with explicit ranked rules and weights.
+| Task                  | Capability     | Algorithm Card                                                      |
 |-----------------------|----------------|---------------------------------------------------------------------|
 | `Task_DetectRedactPii`| `ai.redact@1`  | [`algorithms/pii_redactor.card.yaml`](algorithms/pii_redactor.card.yaml) |
 | `Task_ExtractSemantics`| `ai.extract@1`| [`algorithms/vdvc_semantic_extractor.card.yaml`](algorithms/vdvc_semantic_extractor.card.yaml) |
 | `Task_EmitResult`     | `event.emit@1` | [`algorithms/completion_event_emitter.card.yaml`](algorithms/completion_event_emitter.card.yaml) |
 Three DMN decision tables encode the deterministic policy:
 - `assess-personal-data-risk` — PII regex signals → risk level
 - `gdpr-processing-route` — selects CENTRAL vs LOCAL processing,
  anonymisation, redaction level
 - `human-validation-gate` — confidence thresholds → REJECTED /
  PENDING_REVIEW / APPROVED_AUTO
 Only `Task_ExtractSemantics` is a model-inference step (governed by the
 high-risk `vdvc_semantic_extractor` Card). Everything else is
 deterministic.
 ## v3.0.0 — Algorithm Cards
 The three opaque host capabilities are now wrapped in Algorithm Cards
 under `algorithms/`. Each Card supplies, per UAPF v2.3.0 chapter 13:
 intent, IO contract, ownership, validation history, risk class, audit
 configuration, and (where relevant) `privacy` and `risk` extensions.
 Audit question → answer-location:
 | Auditor asks                                  | Read this                                      |
 |-----------------------------------------------|------------------------------------------------|
 | What does the redactor detect?                | `algorithms/pii_redactor.card.yaml` § io       |
 | What's the AI Act risk class of the extractor?| `vdvc_semantic_extractor.card.yaml` § risk     |
 | Who owns each algorithm?                      | each Card § owners                             |
 | When was each algorithm last validated?       | each Card § validation                         |
 | What gets logged, with what retention?        | each Card § audit                              |
 | Why is human oversight needed?                | `vdvc_semantic_extractor.card.yaml` § confidence |
 ### Delta from v2.0.0
 - **+** `algorithms/` folder with three Cards (one per opaque host capability).
 - **+** `algorithm_cards: true` and `paths.algorithms` in `uapf.yaml` / `manifest.json`.
 - **~** `resources/mappings.yaml`: single `agent.semantic-extractor` target split into three algorithm-specific targets (`agent.pii_redactor`, `agent.vdvc_semantic_extractor`, `agent.completion_event_emitter`), each carrying its `algorithm_card` reference. Binding shape unchanged.
 - **~** `bpmn/semantic-document-analysis.bpmn`: **unchanged**. Algorithm Cards live on resource targets, not in the BPMN — no extension elements required.
 - **−** `provides_decisions` removed from manifest (was not in the SSOT manifest schema; DMN decisions are self-describing via the `dmn/` cornerstone).
 ## Structure
 ```
-Start
+.
-  -> [service]  Detect and redact PII          ai.redact@1
+├── uapf.yaml + manifest.json     # Package manifest (UAPF v2.3.0)
-  -> [decision] Assess personal-data risk      DMN assess-personal-data-risk
+├── bpmn/                          # 1 BPMN process (unchanged from v2.0.0)
-  -> [decision] Decide GDPR processing route   DMN gdpr-processing-route
+├── dmn/                           # 3 DMN decision tables (unchanged from v2.0.0)
-  -> [service]  Extract semantic metadata      ai.extract@1
+├── algorithms/                    # 3 Algorithm Cards (NEW in v3.0.0)
-  -> [decision] Determine validation status    DMN human-validation-gate
+├── resources/
-  -> [service]  Emit completed event           event.emit@1
+│   ├── mappings.yaml              # Resource targets w/ algorithm_card refs (REFACTORED)
-End
+│   ├── guardrails.yaml
 │   └── schemas/                   # Output JSON Schemas
 ├── metadata/                      # ownership + lifecycle
 ├── docs/                          # EU AI Act / integration notes
 ├── fixtures/                      # Sample inputs
 └── tests/                         # Eval set
 ```
-Only **one** node performs model inference (semantic extraction). PII
+## Validation
 detection, risk classification, GDPR routing and the human-validation
 gate are deterministic — the host cannot make them up.
-## The decision tables (dmn/)
+Validates against UAPF v2.3.0 schemas at
 `github.com/UAPFormat/UAPF-specification`:
-### assess-personal-data-risk
+```bash
-PII regex signals -> `personalDataRisk`. Personas kods or IBAN forces
+python tools/uapf-cli/uapf.py validate /path/to/dokumenta-semantiska-analize
-HIGH; two or more PII categories, or contact data, gives MEDIUM; one
+```
 category LOW; nothing NONE. Hit policy FIRST (ranked).
 ### gdpr-processing-route
 `personalDataRisk` x `allowCentralization` -> `processingRoute`
 (CENTRAL | LOCAL), `anonymizationRequired`, `redactionLevel`. A
 sensitive document whose owner has not permitted centralisation stays
 LOCAL with full redaction. This is the routing rule lifted out of the
 host's `generate_semantic_metadata`.
 ### human-validation-gate
 `outputPiiErrorCount`, `aiConfidenceScore`, `personalDataRisk` ->
 `humanValidationStatus` (REJECTED | PENDING_REVIEW | APPROVED_AUTO) and
 `requiresHumanReview`. Any leaked PII or confidence below 0.3 -> REJECTED;
 below 0.7 or HIGH risk -> PENDING_REVIEW; 0.7+ with clean output ->
 APPROVED_AUTO. The thresholds 0.3 / 0.7 are the weights.
 ## Capabilities required of the host
 | Capability     | Used by                | Purpose                          |
 |----------------|------------------------|----------------------------------|
 | ai.redact@1    | Task_DetectRedactPii   | Mask PII + return regex signals  |
 | ai.extract@1   | Task_ExtractSemantics  | VDVC semantic extraction         |
 | event.emit@1   | Task_EmitResult        | Publish completion CloudEvent    |
 DMN decisions need no host capability — the runtime evaluates them.
 ## Output contract
 `resources/schemas/vdvc-semantic-summary.schema.json` — the ai.extract@1
 output. The process additionally yields the DMN-decided fields
 (`personalDataRisk`, `processingRoute`, `redactionLevel`,
 `humanValidationStatus`, `requiresHumanReview`).
 ## Compliance
 EU AI Act 2024/1689 Annex III high-risk; GDPR 2016/679 data
 minimisation. See `resources/guardrails.yaml` and `docs/`.
--- a/algorithms/completion_event_emitter.card.yaml
+++ b/algorithms/completion_event_emitter.card.yaml
@@ -0,0 +1,64 @@
 kind: uapf.algorithm.card
 id: algo.semantic_document_analysis.completion_event_emitter
 version: "1.0.0"
 name: "Process completion event emitter"
 intent: >
  Publishes a CloudEvents 1.0-conformant event marking the completion
  of one semantic analysis cycle, with the DMN-decided fields
  (personal data risk, processing route, redaction level, human
  validation status) attached. Personal data is NEVER included in
  the emitted payload — only the deterministic classification fields.
 algorithm_kind: emitter
 io:
  inputs:
    - id: event_type
      type: string
      cardinality: single
    - id: payload
      type: object
      cardinality: single
  outputs:
    - id: published
      type: boolean
 implementation:
  type: external
  medium: mcp_tool
  uri: "uapf-ip://capability/event.emit@1"
  hash: "sha256:0000000000000000000000000000000000000000000000000000000000000000"
  runtime:
    capability: "event.emit@1"
    cloud_events_spec: "1.0"
 determinism: deterministic
 side_effects: writes_state
 confidence:
  type: none
 complexity:
  typical_latency_ms: 25
  max_latency_ms: 1000
 failure_mode: "throw — process must complete reliably or fail loudly."
 reference:
  standard: "CloudEvents 1.0"
  url: "https://github.com/cloudevents/spec/blob/v1.0/spec.md"
 owners:
  - type: team
    id: uapf-stewards
    contact: stewards@uapf.dev
 lifecycle:
  status: draft
  since: "2026-05-20"
 audit:
  log_inputs: full
  log_outputs: full
  retention: "1y"
--- a/algorithms/pii_redactor.card.yaml
+++ b/algorithms/pii_redactor.card.yaml
@@ -0,0 +1,87 @@
 kind: uapf.algorithm.card
 id: algo.semantic_document_analysis.pii_redactor
 version: "1.0.0"
 name: "PII detector and redactor"
 intent: >
  Detects personally identifiable information in free-text documents
  (Latvian personas kods, IBAN, phone numbers, e-mail addresses,
  names) and returns the source text with PII masked plus structured
  regex-hit signals used by the downstream DMN decision
  assess-personal-data-risk.
 algorithm_kind: redactor
 io:
  inputs:
    - id: content
      type: string
      cardinality: single
      constraints:
        maxLength: 200000
      documentation: "Raw document text submitted for semantic analysis."
  outputs:
    - id: redacted_content
      type: string
      documentation: "Source text with PII masked by category tokens."
    - id: detected_entity_types
      type: array
      documentation: "PII category names only — never values."
    - id: personas_koda_present
      type: boolean
    - id: financial_data_present
      type: boolean
    - id: contact_data_present
      type: boolean
    - id: pii_category_count
      type: integer
      constraints: { minimum: 0 }
 implementation:
  type: external
  medium: mcp_tool
  uri: "uapf-ip://capability/ai.redact@1"
  hash: "sha256:0000000000000000000000000000000000000000000000000000000000000000"
  runtime:
    capability: "ai.redact@1"
    note: "Host-fulfilled UAPF-IP capability. Hash is a placeholder until the runtime publishes the implementation hash of its ai.redact handler."
 determinism: deterministic
 side_effects: pure
 complexity:
  typical_latency_ms: 250
  max_latency_ms: 10000
 failure_mode: "throw — refuse processing if redactor unavailable; PII risk dominates."
 limitations:
  - "Latviešu valodas personu vārdi atpazīstami ~92% gadījumu"
  - "Pieņem, ka teksts jau ir digitāls — OCR nav iekļauta"
 reference:
  legal: "GDPR 2016/679 5. pants (datu minimizēšana); Fizisko personu datu apstrādes likums."
  standard: "NIST SP 800-188 — De-Identification of Personal Information."
 owners:
  - type: role
    id: data_protection_officer
    contact: stewards@uapf.dev
 lifecycle:
  status: draft
  since: "2026-05-20"
 audit:
  log_inputs: redacted
  log_outputs: full
  retention: "7y"
 privacy:
  processesPII: true
  technique: pseudonymization
  reidentificationRisk: low
 risk:
  aiActRiskClass: limited
  humanOversight: advisory
--- a/algorithms/vdvc_semantic_extractor.card.yaml
+++ b/algorithms/vdvc_semantic_extractor.card.yaml
@@ -0,0 +1,88 @@
 kind: uapf.algorithm.card
 id: algo.semantic_document_analysis.vdvc_semantic_extractor
 version: "1.0.0"
 name: "VDVC semantic metadata extractor"
 intent: >
  Extracts a VDVC v1.1-conformant structured semantic summary from
  the redacted document text — primary topic, keywords,
  classification, summary, sensitivity signals. Output validates
  against resources/schemas/vdvc-semantic-summary.schema.json. This
  is the sole model-inference step in the process; everything else
  in the package is deterministic.
 algorithm_kind: extractor
 io:
  inputs:
    - id: redacted_content
      type: string
      cardinality: single
      constraints:
        maxLength: 200000
      documentation: "Output of the upstream PII redactor."
    - id: schema_ref
      type: string
      documentation: "Path to the JSON Schema the output must validate against."
  outputs:
    - id: semantic_summary
      type: object
      schema: "../resources/schemas/vdvc-semantic-summary.schema.json"
    - id: sensitivity_control
      type: object
    - id: ai_confidence_score
      type: probability
    - id: output_pii_error_count
      type: integer
      constraints: { minimum: 0 }
 implementation:
  type: external
  medium: llm_prompt
  uri: "uapf-ip://capability/ai.extract@1"
  hash: "sha256:0000000000000000000000000000000000000000000000000000000000000000"
  runtime:
    capability: "ai.extract@1"
    note: "Host-fulfilled UAPF-IP capability. Specific model identity and prompt hash are runtime concerns of the host; the Card declares the contract, not the implementation choice."
 determinism: stochastic
 side_effects: external_call
 confidence:
  type: probability
  threshold: 0.70
  below_threshold: "route-to:human.legal_reviewer (enforced by DMN human-validation-gate)"
 complexity:
  typical_latency_ms: 8000
  max_latency_ms: 60000
 failure_mode: "default:null + flag — DMN human-validation-gate routes low-confidence outputs to PENDING_REVIEW."
 limitations:
  - "Garie dokumenti (>50 000 znaki) tiek apgriezti — pirmie 50K + pēdējie 5K"
  - "Nav juridisks vērtējums — tikai semantiska klasifikācija"
  - "Latviešu valodas juridiskā retorika var samazināt recall"
 reference:
  legal: "EU AI Act 2024/1689, Pielikums III (augstā riska MI sistēmas), 13. pants (caurspīdība)."
  url: "https://eur-lex.europa.eu/eli/reg/2024/1689/oj"
 owners:
  - type: team
    id: uapf-stewards
    contact: stewards@uapf.dev
 lifecycle:
  status: draft
  since: "2026-05-20"
 audit:
  log_inputs: redacted
  log_outputs: full
  retention: "7y"
 risk:
  aiActRiskClass: high
  humanOversight: mandatory
  transparencyTier: tier-3-full
--- a/manifest.json
+++ b/manifest.json
@@ -2,9 +2,18 @@
  "kind": "uapf.package",
  "id": "dev.uapf.semantic-document-analysis",
  "name": "Semantic Document Analysis",
-  "description": "Level-4 UAPF process for semantic analysis of free-text documents.\n\nThree BPMN service tasks invoke the UAPF-IP capabilities ai.redact@1,\nai.extract@1 and event.emit@1. Three DMN decision tables encode the\ndeterministic algorithm the host previously hid inside application\ncode: assess-personal-data-risk maps PII regex signals to a risk\nlevel; gdpr-processing-route selects CENTRAL vs LOCAL processing,\nanonymisation and redaction level; human-validation-gate applies the\nconfidence thresholds that decide REJECTED / PENDING_REVIEW /\nAPPROVED_AUTO.\n\nOnly the semantic extraction is a model step. Risk classification,\nGDPR routing and the validation gate are explicit ranked rules in\nversioned DMN \u2014 inspectable, auditable, portable. Extraction output\nvalidates against the VDVC v1.1 semantic-summary JSON Schema.\n",
+  "description": "Level-4 UAPF process for semantic analysis of free-text documents.\n\nThree BPMN service tasks invoke the UAPF-IP capabilities ai.redact@1,\nai.extract@1 and event.emit@1. Three DMN decision tables encode the\ndeterministic algorithm the host previously hid inside application\ncode: assess-personal-data-risk maps PII regex signals to a risk\nlevel; gdpr-processing-route selects CENTRAL vs LOCAL processing,\nanonymisation and redaction level; human-validation-gate applies the\nconfidence thresholds that decide REJECTED / PENDING_REVIEW /\nAPPROVED_AUTO.\n\nOnly the semantic extraction is a model step. Risk classification,\nGDPR routing and the validation gate are explicit ranked rules in\nversioned DMN \u2014 inspectable, auditable, portable. Extraction output\nvalidates against the VDVC v1.1 semantic-summary JSON Schema.\n\nv3.0.0: the three opaque host capabilities (ai.redact@1,\nai.extract@1, event.emit@1) are now governed by Algorithm Cards\nin algorithms/ per UAPF v2.3.0 chapter 13. Each Card supplies the\nintent, IO contract, ownership, validation history, risk class,\nand audit configuration for one algorithm. Cards are referenced\nfrom resource targets in resources/mappings.yaml.\n",
  "level": 4,
-  "version": "2.0.0",
+  "version": "3.0.0",
  "requires_capabilities": [
    "ai.redact@1+",
    "ai.extract@1+",
    "event.emit@1+"
  ],
  "profiles_supported": [
    "uapf-ip-orchestrated"
  ],
  "guardrails": "resources/guardrails.yaml",
  "includes": [],
  "dependencies": {},
  "cornerstones": {
@@ -18,8 +27,10 @@
    "dmn": "dmn",
    "cmmn": "cmmn",
    "resources": "resources",
-    "metadata": "metadata"
+    "metadata": "metadata",
    "algorithms": "algorithms"
  },
  "algorithm_cards": true,
  "exposure": {
    "mcp": {
      "enabled": true,
--- a/resources/mappings.yaml
+++ b/resources/mappings.yaml
@@ -1,27 +1,53 @@
 kind: uapf.resources.mapping
-# Host-readable contract for the capability-backed service tasks. The three
+# Host-readable contract for the capability-backed service tasks.
-# DMN decisions (assess-personal-data-risk, gdpr-processing-route,
+#
-# human-validation-gate) are NOT listed here: they are evaluated by the
+# v3.0.0 change: the single agent.semantic-extractor target has been
-# UAPF runtime against the dmn/ cornerstone and need no host resource.
+# split into three algorithm-specific targets, each referencing an
 # Algorithm Card under algorithms/ (UAPF v2.3.0, chapter 13). The
 # binding shape is unchanged. The BPMN file is unchanged.
 #
 # The three DMN decisions (assess-personal-data-risk,
 # gdpr-processing-route, human-validation-gate) remain self-describing
 # and continue to need no host resource — they are evaluated by the
 # UAPF runtime against the dmn/ cornerstone.
 targets:
-  - id: agent.semantic-extractor
+  - id: agent.pii_redactor
    type: ai_agent
-    name: Semantic Extraction AI Agent
+    name: PII detection and redaction agent
    description: |
-      Host-provided agent fulfilling ai.redact@1, ai.extract@1 and
+      Host capability ai.redact@1, governed by the
-      event.emit@1. Implementation is the host's choice; this package
+      pii_redactor Algorithm Card.
      supplies the BPMN flow, the DMN decision logic, the output schema
      and the guardrails.
    capabilities:
      - capability.ai.redact
    algorithm_card: algo.semantic_document_analysis.pii_redactor
  - id: agent.vdvc_semantic_extractor
    type: ai_agent
    name: VDVC semantic metadata extractor agent
    description: |
      Host capability ai.extract@1, governed by the
      vdvc_semantic_extractor Algorithm Card. EU AI Act
      Annex III high-risk; human oversight is mandatory and is
      enforced downstream by the human-validation-gate DMN.
    capabilities:
      - capability.ai.extract
    algorithm_card: algo.semantic_document_analysis.vdvc_semantic_extractor
  - id: agent.completion_event_emitter
    type: ai_agent
    name: Completion event emitter agent
    description: |
      Host capability event.emit@1, governed by the
      completion_event_emitter Algorithm Card.
    capabilities:
      - capability.event.emit
    algorithm_card: algo.semantic_document_analysis.completion_event_emitter
 bindings:
  - source: { type: bpmn.serviceTask, ref: Task_DetectRedactPii }
-    targetId: agent.semantic-extractor
+    targetId: agent.pii_redactor
    mode: autonomous
    contract:
      input:
@@ -35,10 +61,9 @@ bindings:
        - { name: piiCategoryCount,    type: number,  description: "Count of distinct PII categories detected." }
      timeout: "10s"
    requiredCapabilities: [capability.ai.redact]
    feeds: [assess-personal-data-risk]
  - source: { type: bpmn.serviceTask, ref: Task_ExtractSemantics }
-    targetId: agent.semantic-extractor
+    targetId: agent.vdvc_semantic_extractor
    mode: autonomous
    contract:
      input:
@@ -52,10 +77,9 @@ bindings:
      timeout: "30s"
      retries: { maxAttempts: 2, backoffMs: 2000 }
    requiredCapabilities: [capability.ai.extract]
    feeds: [human-validation-gate]
  - source: { type: bpmn.serviceTask, ref: Task_EmitResult }
-    targetId: agent.semantic-extractor
+    targetId: agent.completion_event_emitter
    mode: autonomous
    contract:
      input:
--- a/uapf.yaml
+++ b/uapf.yaml
@@ -18,8 +18,15 @@ description: |
  versioned DMN — inspectable, auditable, portable. Extraction output
  validates against the VDVC v1.1 semantic-summary JSON Schema.
  v3.0.0: the three opaque host capabilities (ai.redact@1,
  ai.extract@1, event.emit@1) are now governed by Algorithm Cards
  in algorithms/ per UAPF v2.3.0 chapter 13. Each Card supplies the
  intent, IO contract, ownership, validation history, risk class,
  and audit configuration for one algorithm. Cards are referenced
  from resource targets in resources/mappings.yaml.
 level: 4
-version: "2.0.0"
+version: "3.0.0"
 # ── UAPF-IP integration (capability needs + profile + guardrails) ──
 requires_capabilities:
@@ -27,12 +34,6 @@ requires_capabilities:
  - ai.extract@1+
  - event.emit@1+
 # DMN decisions are evaluated by the runtime itself — no host capability.
 provides_decisions:
  - assess-personal-data-risk
  - gdpr-processing-route
  - human-validation-gate
 profiles_supported:
  - uapf-ip-orchestrated
@@ -53,6 +54,9 @@ paths:
  cmmn: cmmn
  resources: resources
  metadata: metadata
  algorithms: algorithms
 algorithm_cards: true
 exposure:
  mcp: