VARAM_classification_schema/VDVC_Classification_Assessment_v2.md

# VDVC Document Classification Schema — Assessment & Transformation Proposal

**Subject:** VARAM DVS "Namejs" Document Classification Schema 2026
**VDVC namespace:** `urn:vdvc:classification:2026`
**Regulatory basis:** MK noteikumi Nr. 282 (07.05.2024) "Dokumentu un arhīvu pārvaldības noteikumi"
**Prepared by:** Rihards / PwC Latvia — Digitalization, AI & Cybersecurity
**Date:** February 2026

---

## 1. Executive Summary

VARAM's Document Management System (DVS "Namejs") relies on a classification schema ("klasifikācijas shēma", formerly "lietu nomenklatūra") maintained as a human-edited Excel spreadsheet with **647 coded entries** across 3 domains and up to 5 hierarchy levels.

This assessment identifies **three layers of problems**: data quality issues in the spreadsheet itself (fixable mechanically), structural design issues in the schema (fixable with refactoring), and a **fundamental architectural problem** — the classification philosophy conflates normative document origins with functional classification, producing an unmanageably large, duplicate-heavy taxonomy that is hostile to both human clerks and DVS systems.

The proposed solution is a VDVC-namespaced, Git-versioned XML repository on ProcessGit with a **custom web-based management GUI** (not Excel) backed by XSD-validated XML, served via MCP endpoint for AI-assisted document classification.

---

## 2. Regulatory Framework

### 2.1 MK noteikumi Nr. 282 (07.05.2024)

The governing regulation prescribes a **function-based hierarchical classification** (§33):

| Level | MK Nr. 282 Definition | What It Should Contain |
|-------|----------------------|----------------------|
| **L1** | Institūcijas **funkcija** vai augstākā struktūrvienība | Broad organizational function (e.g., "Management", "HR", "Procurement") |
| **L2** | Funkcijas izpildes nepieciešamie **uzdevumi (procesi)** | Processes within the function (e.g., "Recruitment", "Payroll") |
| **L3** | Uzdevumu veikšanai nepieciešamās **darbības** | Specific activities/document types (e.g., "Employment contracts", "Timesheets") |

Key regulatory requirements:
- Schema must be synchronized with Latvijas Nacionālais arhīvs (LNA) every **5 years** (§42)
- Sector-level schemas ("nozares klasifikācijas shēma") every **8 years** (§42)
- Must specify: index, name, retention term, responsible unit, media type, IS location (§31)
- Classification basis: functions, structural units, document types, or **mixed** (§33)

### 2.2 VDVC Context

The schema should use the **VDVC** (Valsts Dokumentu Vadības Centrs) namespace since VDVC is the country-wide document management authority under VARAM management, and this classification approach could be standardized across government institutions — not just VARAM internally.

---

## 3. Critical Assessment: Why Is the Classification So Complicated?

### 3.1 The Core Problem — Normative Document Proliferation

VARAM's explanation is that every new document category originates from a normative act (law, MK regulation, EU directive) that delegates a process to the organization. When a new regulation is adopted, a new classification entry is created. **This approach is fundamentally flawed** for the following reasons:

#### Problem A: Confusing "What Triggered the Document" with "What Kind of Document It Is"

MK Nr. 282 §33 defines classification by **function and process**, not by **legal basis**. The legal basis for a document is metadata (a property of the document), not a structural category. When VARAM creates a separate category for "Sarakste ar valsts pārvaldes iestādēm DIENESTA VAJADZĪBĀM" (P-1-13-5) vs "Sarakste ar valsts pārvaldes iestādēm, juridiskām un fiziskām personām" (P-1-13-2) vs "Sarakste ar valsts pārvaldes iestādēm jautājumiem, kas saistīti ar valsts noslēpumu" (P-1-13-9), **these are all the same function** (correspondence) with different metadata attributes (audience, classification level).

A proper design would have:
```
P-1-13  Sarakste (Correspondence)
  → metadata: audience = [government | private | foreign | internal | classified]
  → metadata: securityLevel = [public | restricted | secret]
```

Instead of 9 sub-categories of correspondence with identical document types inside them.

#### Problem B: EU Investment Project Explosion

The most egregious example is **I2 (Investīciju projektu ieviešana)** with **33 top-level entries** — each representing a specific EU-funded project:

```
I2-1   Projekta "Informācijas sistēmu ... Nr. 2.2.1.1/17/I/012" dokumenti
I2-2   Projekta "Atvērto datu ... Nr. 2.2.1.1/19/I/004" dokumenti
...
I2-33  Projekta "Valsts pārvaldes vienota valsts finanšu..." dokumenti
```

Each of these 33 projects then has **identical sub-structure**: correspondence, contracts, orders, communications materials. This is a textbook example of **data masquerading as structure**. The project identity is a data attribute, not a classification level.

A proper design:
```
I2-1   Investīciju projektu ieviešana (Project Implementation)
  I2-1-1  Korespondence (Correspondence)
  I2-1-2  Līgumi (Contracts)
  I2-1-3  Rīkojumi, protokoli (Orders, protocols)
  I2-1-4  Komunikācijas materiāli (Communications)
  → metadata: projectId = "2.2.1.1/17/I/012"
  → metadata: projectName = "Informācijas sistēmu..."
  → metadata: fundingSource = "ERAF" | "ANM" | ...
```

This would reduce I2 from **~166 entries to ~10**, while preserving all information through metadata.

#### Problem C: I1 (Investīciju programmu vadība) Duplicates

Similarly, I1 has **16+ programme-level groups** (I1-1 through I1-16), each with largely identical sub-structures for different EU operational programmes. The programmes differ in retention dates and responsible departments, but these are metadata, not structure.

Current: **327 entries** in I1
Proposed: **~40-50 entries** (function-based) + programme as metadata

#### Problem D: Category Count vs. Clerk Cognitive Capacity

With **~400 leaf categories** (plus ~250 structural grouping rows), a clerk creating a new document faces an impossible cognitive task. Research in classification science (Rosch, 1978; Miller, 1956) shows humans can reliably distinguish 7±2 categories at each level. VARAM's schema has:
- 3 domain categories (P, I1, I2) — **good**
- 9-33 L1 categories per domain — **border-case to unmanageable**
- Up to 13+ L2 per L1 — **too many, especially without descriptions**

The result is predictable: clerks default to a handful of "safe" categories, misclassify documents, or spend excessive time navigating the hierarchy — defeating the purpose of classification.

### 3.2 Structural Assessment Summary

| Metric | Current | Proposed (after normalization) |
|--------|---------|-------------------------------|
| Total entries | 647 | ~120-150 |
| Leaf categories (clerk-facing) | ~496 | ~80-100 |
| I2 entries | 166 | ~10-15 |
| I1 entries | 327 | ~40-50 |
| Max L2 categories per L1 | 33 | ≤10 |
| Project-specific categories | ~200+ | 0 (metadata) |
| Duplicate structural patterns | ~30 identical sub-trees | 0 |

### 3.3 What Should Be Structure vs. What Should Be Metadata

| Currently a Category Level | Should Be | Reason |
|---------------------------|-----------|--------|
| Specific EU project name | **Metadata tag** | Project is an instance, not a function |
| Specific EU programme | **Metadata tag** | Programme is a funding context |
| Audience of correspondence | **Metadata enum** | Audience doesn't change the document type |
| Security classification | **Metadata field** | Orthogonal to document function |
| EU Commission flag on retention | **Metadata boolean** | Compliance attribute, not structure |
| Department assignment | **Metadata reference** | Departments change; functions don't |

---

## 4. Data Quality Issues (Spreadsheet-Level)

### 4.1 Issues Summary

| # | Issue | Severity | Scope |
|---|-------|----------|-------|
| 1 | Mixed code separators (hyphens and dots) | CRITICAL | 221/647 codes (34%) |
| 2 | NBSP (\\xa0) disguised as empty retention | HIGH | 93 rows |
| 3 | 50+ retention term format variants | HIGH | 496 retention values |
| 4 | Zero descriptions in Description column | HIGH | 100% of rows |
| 5 | Multi-department free-text assignments | MEDIUM | 67 rows |
| 6 | Level data in wrong columns | MEDIUM | 64 rows |
| 7 | Typo: `Il-9-2` instead of `I1-9-2` | LOW | 1 row |
| 8 | Trailing dot in code `I1-13-1.1.` | LOW | 1 row |
| 9 | Typo: "Patstāvīgi" instead of "Pastāvīgi" | LOW | 4 rows |

*(Full technical detail in previous assessment — omitted for brevity)*

### 4.2 Retention Term Normalization

50+ free-text variants need consolidation to 5 structured types:

| Type | Example Input | Structured Output |
|------|--------------|-------------------|
| Permanent | "Pastāvīgi", "Patstāvīgi" | `<permanent/>` |
| Duration | "5 gadi", "75 gadi" | `<duration years="5"/>` |
| Duration + trigger | "5 gadi pēc projekta noslēguma..." | `<duration years="5" trigger="project_closure"/>` |
| Fixed date | "31.12.2034.", "2031-12-31 00:00:00" | `<fixedDate>2034-12-31</fixedDate>` |
| EU flagged | "31.12.2032.    EK" | `<fixedDate euCommission="true">2032-12-31</fixedDate>` |

---

## 5. Proposed Architecture

### 5.1 Design Principles

1. **VDVC namespace** — `urn:vdvc:classification:2026` — reusable across government
2. **Function-first classification** — per MK Nr. 282 §33, classify by what the organization does, not by what regulation triggered the document
3. **Metadata-rich, structure-lean** — project, programme, audience as tags, not tree levels
4. **No Excel** — custom web GUI that edits backend XML directly; prevents spreadsheet drift
5. **Git-versioned SSOT** — XML on ProcessGit with full audit trail
6. **MCP-served** — machine-readable API for DVS integration and AI-assisted classification

### 5.2 XSD Schema (VDVC Domain)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="urn:vdvc:classification:2026"
           xmlns:vdvc="urn:vdvc:classification:2026">

  <xs:element name="classificationSchema" type="vdvc:SchemaType"/>

  <xs:complexType name="SchemaType">
    <xs:sequence>
      <xs:element name="metadata" type="vdvc:MetadataType"/>
      <xs:element name="vocabularies" type="vdvc:VocabulariesType"/>
      <xs:element name="domains" type="vdvc:DomainListType"/>
    </xs:sequence>
    <xs:attribute name="version" type="xs:string" use="required"/>
    <xs:attribute name="effectiveDate" type="xs:date" use="required"/>
    <xs:attribute name="institution" type="xs:string" use="required"/>
  </xs:complexType>

  <!-- Controlled vocabularies (departments, programmes, projects) -->
  <xs:complexType name="VocabulariesType">
    <xs:sequence>
      <xs:element name="departments" type="vdvc:DeptListType"/>
      <xs:element name="programmes" type="vdvc:ProgrammeListType" minOccurs="0"/>
      <xs:element name="projects" type="vdvc:ProjectListType" minOccurs="0"/>
      <xs:element name="retentionTerms" type="vdvc:RetTermListType"/>
    </xs:sequence>
  </xs:complexType>

  <!-- Category node (recursive, function-based) -->
  <xs:complexType name="CategoryType">
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="description" type="xs:string" minOccurs="0"/>
      <xs:element name="legalBasis" type="xs:string" minOccurs="0"
                  maxOccurs="unbounded"/>
      <xs:element name="retention" type="vdvc:RetentionType" minOccurs="0"/>
      <xs:element name="responsibleUnits" minOccurs="0">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="unitRef" type="xs:string" maxOccurs="unbounded"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element name="mediaType" type="vdvc:MediaTypeEnum" minOccurs="0"/>
      <xs:element name="system" type="xs:string" minOccurs="0"/>
      <xs:element name="applicableContexts" minOccurs="0">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="programmeRef" type="xs:string"
                        minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="projectRef" type="xs:string"
                        minOccurs="0" maxOccurs="unbounded"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
      <xs:element name="subcategory" type="vdvc:CategoryType"
                  minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="code" type="xs:string" use="required"/>
    <xs:attribute name="level" type="xs:positiveInteger" use="required"/>
    <xs:attribute name="status" type="vdvc:StatusEnum" default="active"/>
  </xs:complexType>

  <!-- Retention: structured, not free-text -->
  <xs:complexType name="RetentionType">
    <xs:choice>
      <xs:element name="permanent" type="xs:boolean"/>
      <xs:element name="duration">
        <xs:complexType>
          <xs:attribute name="years" type="xs:positiveInteger" use="required"/>
          <xs:attribute name="triggerEvent" type="xs:string"/>
        </xs:complexType>
      </xs:element>
      <xs:element name="fixedDate" type="xs:date"/>
    </xs:choice>
    <xs:attribute name="euCommission" type="xs:boolean" default="false"/>
    <xs:attribute name="legalReference" type="xs:string"/>
    <xs:attribute name="originalText" type="xs:string"/>
  </xs:complexType>

  <!-- Enumerations -->
  <xs:simpleType name="MediaTypeEnum">
    <xs:restriction base="xs:string">
      <xs:enumeration value="electronic"/>
      <xs:enumeration value="paper"/>
      <xs:enumeration value="hybrid"/>
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="StatusEnum">
    <xs:restriction base="xs:string">
      <xs:enumeration value="active"/>
      <xs:enumeration value="deprecated"/>
      <xs:enumeration value="draft"/>
    </xs:restriction>
  </xs:simpleType>
</xs:schema>
```

**Key design decisions:**
- `legalBasis` is a **metadata field on categories**, not a structural level — normative acts reference which regulations require this category, but don't create separate tree branches
- `applicableContexts` with `programmeRef` / `projectRef` replaces the 33 duplicate I2 sub-trees — a single "Korespondence" category can be tagged with all applicable projects
- `status` enables deprecation without deletion (audit trail)
- `retentionType` with `legalReference` links retention to its legal source

### 5.3 Proposed Simplified Classification Tree

```
VARAM Classification Schema (VDVC:2026)

P — Pārvalde (Administration)
├── P-1  Iestādes vadība (Institutional Management)
│   ├── P-1-1  Normatīvie dokumenti (Regulatory documents)
│   ├── P-1-2  Rīkojumi (Orders)
│   ├── P-1-3  Sanāksmes un protokoli (Meetings & protocols)
│   ├── P-1-4  Plānošana un pārskati (Planning & reporting)
│   ├── P-1-5  Sarakste (Correspondence)
│   │         → metadata: audience, securityLevel
│   ├── P-1-6  Pilnvaras un lēmumi (Authorizations & decisions)
│   └── P-1-7  Drošība un trauksme (Security & whistleblowing)
├── P-2  Budžets (Budget Planning)
├── P-3  Personālvadība (HR Management)
│   ├── P-3-1  Darba līgumi (Employment contracts)
│   ├── P-3-2  Personāla lietas (Personnel files)
│   ├── P-3-3  Apmācības (Training)
│   └── P-3-4  Novērtēšana (Performance evaluation)
├── P-4  Saimnieciskie jautājumi (Facilities)
├── P-5  Iepirkumi (Procurement)
├── P-6  Juridiskā funkcija (Legal)
├── P-7  Komunikācija (Communications)
├── P-8  Audits (Audit)
└── P-9  Finanšu vadība (Financial Management)

I1 — Investīciju programmu vadība (Programme Management)
├── I1-1  Programmu plānošana (Programme planning)
├── I1-2  Uzraudzība un kontrole (Monitoring & control)
├── I1-3  Finanšu pārvaldība (Financial management)
├── I1-4  Ziņojumi un pārskati (Reports)
├── I1-5  Maksājumi un pārbaudes (Payments & verification)
└── I1-6  Sarakste un lēmumi (Correspondence & decisions)
          → metadata: programmeRef = [ERAF, ANM, ESF, ...]

I2 — Investīciju projektu ieviešana (Project Implementation)
├── I2-1  Korespondence (Correspondence)
├── I2-2  Līgumi un grozījumi (Contracts & amendments)
├── I2-3  Rīkojumi un protokoli (Orders & protocols)
├── I2-4  Komunikācija (Communications materials)
├── I2-5  Finanšu dokumentācija (Financial documentation)
└── I2-6  Noslēguma dokumenti (Closure documents)
          → metadata: projectRef = [project-001, project-002, ...]
```

**From 647 entries → ~80-100 functional categories** + rich metadata vocabularies.

---

## 6. Custom Web GUI (Not Excel)

### 6.1 Why Not Excel

| Problem with Excel | Impact |
|-------------------|--------|
| People edit the Excel directly, bypassing validation | Reintroduces data quality issues |
| Cannot enforce controlled vocabularies | Free-text retention terms return |
| Cannot represent metadata (project/programme tags) on categories | Structural duplication returns |
| No validation against XSD schema | Invalid data enters the system |
| No version control / audit trail | Changes are invisible |
| Cannot embed business logic (retention calculation, department lookup) | Manual errors |
| Multiple people can have different versions | No SSOT guarantee |

### 6.2 GUI Architecture

```
┌─────────────────────────────────────────────┐
│           VDVC Classification Editor         │
│  ┌───────────────────────────────────────┐  │
│  │  Tree Navigator (collapsible)          │  │
│  │  ├── P — Pārvalde                      │  │
│  │  │   ├── P-1 Iestādes vadība          │  │
│  │  │   │   ├── P-1-1 Normatīvie dok.    │  │
│  │  │   │   └── P-1-2 Rīkojumi ←[EDIT]  │  │
│  │  └── I1 — Investīciju programmas       │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │  Category Detail Panel                 │  │
│  │  Code: [P-1-2]     Status: [Active ▼] │  │
│  │  Name: [Rīkojumi un to pielikumi...]   │  │
│  │  Description: [Ministru rīkojumi...]   │  │
│  │  ─── Retention ───                     │  │
│  │  Type: [Permanent ▼]                   │  │
│  │  Legal ref: [MK Nr. 282 §31]          │  │
│  │  ─── Responsibility ───                │  │
│  │  Departments: [LN ×] [KD ×] [+ Add]   │  │
│  │  ─── Context ───                       │  │
│  │  Programmes: [all]                     │  │
│  │  Media: [Electronic ▼]                 │  │
│  │  System: [DVS Namejs ▼]               │  │
│  │  ─── Legal Basis ───                   │  │
│  │  [+ Add normative reference]           │  │
│  └───────────────────────────────────────┘  │
│  [Save] [Validate] [Preview XML] [History]  │
└─────────────────────────────────────────────┘
```

### 6.3 GUI Features

| Feature | Purpose |
|---------|---------|
| **Tree navigation** | Hierarchical browse/search with drag-drop reordering |
| **Controlled vocabulary dropdowns** | Departments, retention types, media types — no free-text |
| **Inline XSD validation** | Real-time validation as users edit; cannot save invalid data |
| **Retention calculator** | Input retention rule → system shows calculated expiry per document date |
| **Department lookup** | Autocomplete from VDVC organization registry (ProcessGit VARAM MCP) |
| **Diff / history view** | Git-backed change tracking with who-changed-what |
| **Bulk import** | One-time import from current Excel, then Excel is retired |
| **Export views** | Generate read-only Excel, HTML, PDF for stakeholders |
| **Legal basis linker** | Reference normative acts by Latvijas Vēstnesis number |
| **Multi-user with roles** | Lietvedis (view), department editor, schema admin |

### 6.4 Technology Stack

```
Frontend:  React + Tailwind (ProcessGit-integrated SPA)
Backend:   ProcessGit API + MCP server
Storage:   Git repository (XML + XSD)
Validation: Client-side XSD validation + server-side on commit
Auth:      ProcessGit OAuth / VARAM SSO
Deploy:    processgit.org/VARAM/Document_classification_schema/
```

---

## 7. ProcessGit Repository Structure

```
VARAM/Document_classification_schema/
├── README.md
├── schema/
│   ├── vdvc-classification-2026.xsd       ← Schema definition
│   └── vdvc-vocabularies.xsd              ← Shared controlled vocabularies
├── data/
│   ├── varam-classification-2026.xml      ← Canonical SSOT
│   ├── vocabularies/
│   │   ├── departments.xml                ← Cross-ref with VARAM org registry
│   │   ├── programmes.xml                 ← EU programme registry
│   │   └── projects.xml                   ← Active project registry
│   └── archive/
│       └── original-excel-2026.xlsx       ← Original for audit trail
├── gui/
│   ├── index.html                         ← Classification editor SPA
│   ├── src/                               ← React components
│   └── package.json
├── render/
│   ├── classification.xslt               ← Human-readable transform
│   └── classification.html               ← Auto-generated view
├── mcp/
│   └── server-config.yaml                ← MCP server endpoint
├── tools/
│   ├── import-excel.py                   ← One-time Excel import
│   ├── export-excel.py                   ← Read-only Excel generation
│   ├── validate.py                       ← XSD validation
│   └── retention-calculator.py           ← Retention date computation
└── docs/
    ├── migration-mapping.md              ← Old code → new code mapping
    └── normative-basis.md                ← Legal references
```

---

## 8. MCP Server Integration

Extend the existing ProcessGit MCP pattern (already live for VARAM Organizations Register):

| MCP Tool | Input | Output |
|----------|-------|--------|
| `vdvc:search` | Full-text query in LV/EN | Matching categories with context |
| `vdvc:get_category` | Category code | Full details + metadata |
| `vdvc:list_categories` | Filters: domain, level, dept, programme | Filtered list |
| `vdvc:suggest_category` | Document title + body text | Top 3-5 category suggestions with confidence |
| `vdvc:validate_code` | Category code | Validity check + active status |
| `vdvc:calculate_retention` | Category code + document date | Retention expiry date |
| `vdvc:describe_model` | — | Schema structure, vocabularies, stats |

The `suggest_category` tool is the **key efficiency enabler**: instead of a clerk navigating ~100 categories, the AI reads the document and recommends the best matches.

---

## 9. Roadmap

| Phase | Duration | Deliverables |
|-------|----------|-------------|
| **Phase 1**: Assessment approval & schema design | 1 week | Approved XSD, normalization rules, migration mapping |
| **Phase 2**: Data cleaning & functional restructure | 2-3 weeks | Normalized XML with ~100 categories; old→new code mapping |
| **Phase 3**: GUI development | 3-4 weeks | React SPA on ProcessGit; tree editor, validation, export |
| **Phase 4**: ProcessGit deployment & MCP server | 1-2 weeks | Live repo, MCP endpoint, vocabularies |
| **Phase 5**: AI description generation | 1-2 weeks | AI-drafted Latvian descriptions for all categories |
| **Phase 6**: DVS "Namejs" integration | 2-3 weeks | Classification import adapter, clerk-facing AI assist |

**Total: 10-15 weeks**

---

## 10. Risk Assessment

| Risk | Impact | Mitigation |
|------|--------|------------|
| LNA (Latvijas Nacionālais arhīvs) rejects restructured schema | HIGH | Maintain old↔new code mapping; preserve all retention terms with `originalText`; engage LNA early |
| Lietvedis staff resist moving from Excel | MEDIUM | GUI provides Excel-like table view; generate read-only Excel exports on demand |
| Normative acts explicitly reference old codes | MEDIUM | Deprecate rather than delete; old codes resolve to new via alias table |
| Project-as-metadata breaks DVS "Namejs" import format | MEDIUM | Provide flat-file export that expands metadata back to rows for legacy DVS |
| Functional restructure conflicts with department ownership | MEDIUM | Map departments to functions, not to categories; allow multi-department tags |

---

## 11. Conclusion

The current classification schema is complicated not because document management is inherently complex, but because **normative document origins have been used as structural taxonomy levels** instead of metadata. Every new EU project, every new regulatory delegation, creates a new branch in the tree rather than a new tag on an existing functional category.

The proposed approach:
1. **Restructures** the tree from 647 entries to ~100 functional categories
2. **Enriches** each category with metadata (project, programme, legal basis, audience)
3. **Replaces Excel** with a validated, Git-backed web GUI
4. **Serves** the schema via MCP for AI-assisted classification
5. **Complies** with MK Nr. 282 §33 function-based classification requirements

The VDVC namespace ensures this approach can be replicated across government institutions, not just VARAM.