Classifying Enterprise Data for Retention

By: Gary Link

Executive Summary

You are the new Records & Information Manager for a large organization that employs many computer applications that perform its work and maintain its business data. The organization maintains a catalog of computer systems numbering in the hundreds, many of which have been running for years. One thing the organization has never done is, of course, apply retention policy to the data in those computer applications. That is now your job. Where do you start?

You will need an approach for classifying data in these enterprise computer applications that employs the principles and practices of Records & Information Management(“RIM”) in combination with other professional disciplines, namely, Data Management, Information Architecture, and Business Process Analysis. And you will need collaboration - lots of collaboration – with your business partners in the departments housing those disciplines.

Your task, then, is an Information Governance (“IG”) effort, and best accomplished within an IG Framework. However, because IG and IG Frameworks are covered in great length in other resources, this article foregoes a general discussion of IG and will focus instead on the working level of accomplishing your task. Introductions on the other disciplines, however, are provided.

Part 1 of this article gives introductory information on Information Architecture, Data Management, and, to a lesser extent, Business Process Analysis. Each of these will start off a with high-level definition and description of the field, followed by the select parts (mostly key terms) of that field that you will need to complete your task. Note that a full discussion of these fields and their entire body of principles, practices, and key terms is not in the scope of this article.

In Part 2 you will combine the three disciplines discussed in Part 1 to develop criteria and rules to build a methodology to address classification in the myriad types of applications you will encounter.

In Part 3 you will execute on your project. Part 3 will assume you have an enterprise IG Framework in place, or at least a project-level framework (which you can later parlay later into permanent one) in which you’ll execute your classification.

Part ONE: Your Business Partner Disciplines

Information Architecture

Information Architecture is defined as “the overall design of a computing system and the logical and physical interrelationships between its components. The architecture specifies the hardware, software, access methods and protocols used throughout the system.”[i] An organization’s overall infrastructure of IT systems, platforms, and applications is referred to as Enterprise Architecture. While Solution Architecture and Technology Architecture are employed at the individual computer application level. Understanding architecture concepts and terms at both the enterprise and application levels will help you to provide the necessary context for data classification.

Your organization will likely have an inventory of its IT assets:

Configuration Management Database (“CMDB”) – this is your organization’s catalog of IT assets. The CMDB tracks things like software type, ownership, architecture, recovery time objectives, and install status.

Configuration Items (“CI’s”) – This is a general term for the items listed in the CMDB. CI’s can include software, hardware, network, and storage. You will be working with software.

Your CMDB will likely have a category for types of software, including, but not limited to:

Application - Standalone programs that perform a business function and are used by or communicate with end users.
Tool - Software that performs certain functions, like calculations, but are not part of the end users’ business functions.
Platform - Infrastructure on which software is executed. Applications and tools may reside on platforms.

These classifications will be important to you because of the three, typically only Applications will be repositories for data.

For each application, the CMDB will have two sets of stakeholders with whom you’ll need to work: the business owners and the technology team.

Application Owner – Has the business and budget responsibility for the application and will know purpose for which it was built or bought.
Custodians – Report to the Application Owner and operate the application from a business function standpoint.
Technology Team – IT professionals who run the machine. The CMDB will provide the names of the colleagues serving in the various technology positions. One of these roles, possibly titled “Application System Manager” or “Responsible Technology Manager,” will be the point person for your effort.

The CMDB may also have a category for the location in which the application operates:

On Prem – within the internal operation environments and control of the organization, “inside the firewall.”
Vendor Hosted – Colleagues access an online system owned and operated by a vendor that performs a specific function.
Cloud – Storage, computing, and related IT-enabled capabilities are delivered from a vendor as a service using internet technologies.

The type of location will play a large role in how you implement the retention classification you have selected for the application. Data retention in vendor-hosted applications likely will need to be managed via the agreement with the vendor, depending on which party is managing the data on the system. Likewise with the cloud vendor.

Leaving the CMDB, application data will typically exist in two environments:

Production Environment – Environment where the application code runs live and fully operationally, and the application’s services are fully accessible to users.
Non-Production Environment, or “the Lower Environments” - Pre-production environments for development, testing, and quality assurance, which may be further designated in the three “lower environments” of DEV, TEST, and QA.

Knowledge of which of these environments the data resides in will aid you in classifying the data. In retention classification, data in lower environments will typically default to non-records, except when they are collected from the environment to document the IT project, or some related function, in the form of artifacts. In these cases, the record data are removed from the lower environments and placed into a filing system or system of record documenting the function being fulfilled.

Data Management

As a discipline, Data Management is convergence of quality, data management, data policies, business process management, and risk management surrounding the handling of data. In an organization, it includes the decision-making framework and authority for all data-related matters. Its activities can include data collection, processing, storage, authenticating, certification, security & privacy, governance, and lifecycle management.

Data Management Key Terms on the Data Level:

Data - Information that has been translated into binary form, usually residing in an application and distinguished from “control information” in the application.
Record - A complete set of information, composed of fields, each of which contains one item of information.
Data Element - Atomic unit of data having precise meaning or semantics; has a name, definition, representation terms, and metadata. The smallest piece of information considered meaningful and usable. In a record, field names can include “Name,” “Address,” “Phone Number.”
Data Set – Contrasted with a Date Element, a Data Set is a complete collection of data elements into a form as simple as a data table or as complex as a computer application.
Structured Data – Data that conforms to a specific, pre-defined schema or data model, or is tagged or otherwise arranged into database tables (rows and columns). Examples include data in relational databases, data in graph databases, call data records, financial transactions, and system audit logs.
Unstructured Data – Content that does not conform to a specific, pre-defined data model, or is not tagged or otherwise structured into database tables (rows and columns). Examples include documents, presentations, graphics, images, text, reports, videos, or sound recordings.
Semi-Structured Data – This is unstructured data maintained in a structured storage system.
Backups – Data backups are duplicate data and therefore Non-Records. Backups are a business continuity measure and are typically retained per the backup rotation – the subsequent backup overwriting the previous.

Data Management Key Terms on the Application Level are defined by their place in the data stream:

Sourcing Mechanisms/Origination Systems - Provide data to the Systems of Record.
Systems of Record - Definitive data store for a given element or piece of information.
System of Reference - Designated consuming system the is the preferred location for consumption.
Consuming System - Downstream application from the system of record or a system of reference that uses the upstream system’s data for a specific purpose.

Within an application data may be classified into these groups:

Production Data - Data in an electronic medium which is either transactional (used for day-to-day business processing) or configuration information used to drive the functioning of the application. This data is held within an application or associated database/file structure and limited to those systems that are defined as the production system of record. It does not include error or process logs written by the system. It is typically the business data supporting the function for which the application was built or bought.
System Data – Errors or process logs and performance data are written by the system. If being monitored, captured, and analyzed for reporting this data would be classified for retention independent of the Production Data. Problems and incidents are typically addressed and documented in systems outside of the application in problem and incident management systems.
Reference Data – Transactional applications may have look-up tables that have the account information of the customers conducting the transactions. This is duplicate data from the System of Record for the account, and is likely refreshed daily. Superseded versions/overwritten data do not have retention. However, if the transactional data is archived to another data store, there corresponding reference data will need to go into the archive solution as well.

Business Process Analysis

With the name practically providing the full definition, Business Process Analysis (BPA) is typically considered to have five distinct sequential activities: processes review, data collection, process analysis, identify improvement, and make changes. You will not be so much concerned about the last two activities. You only want to understand the business process that is creating the data, you do not intend to change that process.

Of the many BPA tools, process mapping (or visualization) is the one that will serve you in this project. Process mapping illustrates process flow using standard, pre-defined symbols to that illustrate the parties involved, steps taken, and interactions between the parties. While process mapping software exists, and you may already have some at your disposal, as the basic symbols exist in Word and PowerPoint.

You will likely only use your process mapping for long business processes that include many different parties creating a lot of different documents and data. Your purpose will be to identify all of the document and data repositories, and which have content that becomes part of the final product – Records – and which remain behind as drafts – Non-Records. The loan origination process comes to mind as a good example for such a process.

Part TWO: COMBINING THE DISCIPLINES FOR CLASSIFICATION

When starting your classification project, know that every application will be unique, each with its own purpose, architecture, and data managed. You will employ endless variations of combinations of the three disciplines discussed in Part 1 to set criteria and rules that you apply to the applications you’ll encounter. Some applications will be standalone. Some will need to be put within the context of their place in a data management stream in order to be classified. Some will have to be put in the context of their place within a business process.

We cannot anticipate every type of application you will encounter, but we can identify some common types and make some categories.

Standalone Applications

Some applications can be classified by themselves without any dependency on their place within a data stream nor within a business process.

Transactional Applications

Possibly the easiest type of application to classify may be the transactional application. This type simply executes and tracks an action. Examples would be bank consumer transactions (deposits, withdrawals) or point of sale applications. The record for the transaction (the Data Management record) will have a date field in which the transaction occurred. This will be the field that corresponds to the retention trigger, in this case “the date of the transaction.” These types will be more straightforward than other than more complex applications.

Again, some transactional applications may have a lookup table that references a customer’s account. As mentioned in Part 1, this is duplicate data from the System of Record for the account, and is likely refreshed daily. Superseded versions/overwritten data do not have retention.

Asset Management Applications

These applications are inventories of things like fixes assets or IT assets (like the CMDB) that track the asset through its acquisition through its retirement or decommission. The retention is typically “life of the asset plus [x] years.” So the date field in the asset record that will correspond with the retention trigger will be the “decommission date” field.

Customer Account Applications –

In contrast with transactional applications, account applications set conditions, agreements, authorizations, and benefits that last the life of the account. The retention is typically “life of the account plus [x] years,” and the date field in the asset record that will correspond with the retention trigger will be the “account terminated” field.

One complexity with this event-driven retention is that customers might not formally close the account. The organization may opt to set a rule that accounts with no activity after [x] years will be closed. For some accounts, like bank accounts, there may be customer property or money remaining in the account. The organization will have to follow its state’s escheat laws to disposition the customer’s money or property.

Another complexity is when a condition, agreement, authorization, or benefit within the account changes. But what to do with the superseded condition? The organization may need to be able to prove which benefit was in existence in which point in time. This will depend to some extent internal architecture and how the records (data management records) are set up. They may just add a new record to the account. Or they may set up a relational history table for the account where inactive conditions, agreements, authorizations, and benefits are sent.

Full-Service Account Management Applications –

Some customer account applications do much more than maintain conditions, agreements, authorizations, and benefits. These provide document management features wherein the customer can submit documents and the organization has account managers dedicated to the customer who review the documents and may, or may not, ingest the documents into the customer’s account. Such an application may also have messaging between the customer and the account manager, and between account managers.

For document management, typically there is a space provided for document uploads and reviews that is essentially an FTP (file transfer protocol) space. In this space the document’s existence should be temporary and have an auto-delete after a short period of time. Documents to become part of the customer’s account are typically ingested into a repository connected to the application.

For messages, the organization may opt for a retention equal to its retention for instant messages in its other, stand alone, messaging environments. A retention of “life of the account plus” is unlikely for messages. But that is up to the organization.

Models –

Computer models use data, mathematical formulas, and computational algorithms to predict events and behaviors in the real world. Unlike other types of applications, a model’s code may need to have its retention set commensurate with the results it produces. This is because the organization may have need to show how it came up with its projections.

Web Applications –

Because many organizations post prices, terms, and conditions on their websites, they may have need to keep those pages to prove what was advertised or in effect during any given time. The organization may opt to archive pages or retain the underlying data & code to enable them to reproduce the pages.

Applications in Relation to their Place in the Data Stream

Originating Systems –

As stated in Part 1, originating systems gather data or documents from outside sources (example: potential customers).

In the cases where an originating system is part of a business process, like an application for a loan or account, these documents and data might or might not transfer to a system of record. In a completed and approved business process, the data and documents will typically move or be copied to a system of record. Conversely, the applicant might be declined by the business or the applicant may simply cease the application process unfinished.

In this type of business process the originating system will have two types of data: unique data and duplicate data.

The data of the declined or withdrawn application will be the unique data. Data from approved applicants are either moved or copied to the system of record. If copied, the copy remaining in the originating system will be duplicate data. Since in the RIM world we define duplicates as Non-Records, the duplicate data will not have retention beyond its operational need, which should be defined in the system documentation.

It is the unique data – the data of the declined or withdrawn applications – that will need a retention policy as a Record. This is the data your organization will need to prove cessation by the applicant or defend their rejection of the application.

Systems of Record –

The Production Data in a System of Record will always be a Record in the RIM sense and always need a retention policy, i.e., be mapped to a Record Series in your organization’s Records Retention Schedule.

Systems of Reference & Consuming Systems –

System of Reference is a Data Management term, and the “reference” focuses on Data Elements rather than the Data Set of the entire application. A system of reference is typically composed entirely of duplicate data elements. Often data elements in a System of Reference originate from many upstream applications but all go into one new application, and together form a unique Data Set with an express purpose or function. So while from the Data Management view, this type of System of Reference is a duplicate, from the RIM view the data in this application is a Record unto itself.

By contrast, a System of Reference whose entire Data Set is a complete duplicate of one contributing System of Record would be a duplicate, or Non-Record in your RIM project. An example would be an application whose sole purpose is to generate reports.

Applications in Relation to their Place in a Business Process

Here you will also have originating systems, systems of record, consuming systems, and tools. In this case, they will be defined not by their place in a Data Stream, but rather, in the Business Process.

In a business process, structured and unstructured data will be collected from various sources and put into one system. That system will either decision the data, or use a different application as a decisioning engine. Approved data sets will either convert the data into electronic documents or will use an interim document management tool to do so. The documents will then be printed out and signed. The signed documents will then be converted back to electronic documents and filed into a repository. And from all of that, you will have to determine which systems maintain Records and which systems contain Non-Records.

Here is where your Business Process Analysis and Process Flow Diagrams will come into play. By mapping out the process, you will endeavor to determine the Record or Non-Record status of a system’s content after the process passes it by. For example, your document management tool that generated the forms from the data fed to it. After the forms it created are printed out and signed, those signed documents are now the record, and the unsigned forms remaining in the document management tool most likely are now Non-Records. Likewise, the signed paper copies can themselves become superseded as the Record after they’ve been scanned and the digital images are uploaded to the business process’ repository – the system of record.

You can even develop your own business process diagram designed specifically to show this, and illustrate the flow in such a way as to show the systems that have Records and those that have Non-Records.

Documenting Your Approach

Prior to beginning the execution project, you will want to convert this part (Part 2) into a set of criteria and rules and have then agreed upon by the project participants. Starting the project with a set of rules agreed upon by all stakeholders will exponentially increase its chances of success. Also, having attained said agreement by all stakeholders, you will have cleared likely the largest hurdle in your project.

You should also set forth a process for classifying the applications. Although each application will be different, the process you use for each should be the same – identifying the Business Owners and Technology Managers, methods and schedules for engagement, how will the final classification be decided, documentation to be generated, how will progress be tracked, what repository will be created for the documentation?

A quick process outline might be:

Initial Meeting between the Project Team and the Application Team to explain the project and purpose.
Identification of the Production Data/Business Data in the application.
Project Team takes the conversation back to their group.
Follow Up meeting and identification of Record Series.
Approval by Application Owner of Record Series.
Addition of Record Series in the application’s entry in the CMDB.

All the while you are keeping track of each step for each application in your project tracker, and reporting progress weekly to the project stakeholders.

Part three: Executing your CLASSIFICATION PROJECT

A bonus discipline: Project Management

Armed with your agreed-up approach with criteria, rules, and a process, you are now ready to do the work of classification. Like all projects, you will set up the classic project components.

Scope –

Define the universe of applications you will classify by listing them from your CMDB. Here you typically also define the out-of-scope systems, such and tools and platforms which don’t directly maintain data, as well as application types that don’t maintain data, such as “pass-throughs.”

Will the scope be all of the applications? If you have hundreds of applications, you may want to prioritize them by criticality and make only the top tier applications within the scope of the project. If you do this, your project will need to include a transition plan to transition the classification from a project into a BAU (business as usual) effort to complete the remaining applications.

Schedule/Timeline –

Make your timeline realistic, or your project will end up in “Yellow” or “Red” status. How many applications total? Time allotted for one application in your written process?

Keep in mind that, within your process – if you use the suggested outline in Part 2 – the items that will extend your timeline will be scheduling the application teams. Their calendars will be full, and often you will get partial attendance, after which they’ll have to take the meeting contents back to their group for agreement. This will be especially true with the Business Owner of the application – the person who will have to approve the Record Series that you’ve decided upon with the rest of the team.

Start off with targets by quarter, then adjust as you see how many applications you can classify in a quarter.

A pilot of a few willing application teams is recommended to test and adjust your process, and also to gauge the average time it will take to classify an application.

Resources –

You will need a project team. In addition to yourself, will you have a Business Analyst to help with the documentation? Will someone on your RIM team help with the engagement, progress tracking? Who will keep after the application teams to reply with their input and approvals?

If your organization has a Project Management Office, it would be better for you to get a Project Manager and for you to take the role as the RIM Subject Matter Expert (SME). An experienced PM will know how to navigate the project management process at your organization and help ensure your project stays in “Green” status.

Communications –

A standard project communications plan should spell out who will do the reporting, in what format, at what frequency, and to which stakeholders.

What will your classification produce and where will the product reside?

At minimum, you will assign the Record Code of a Records Series from your organization’s Records Retention Schedule. For a repository for this code, ideally the group that manages your CMDB, possibly the IT Service Management (ITSM) group, would create a new field in the CMDB for Record Code and make it a Required Field for all new applications. That is classification from an RIM standpoint.

But while you have the application teams together, you could consider a more wholistic product – one that will provide a compliance process to follow going forward. As the RIM expert, you could write a template for an application-level data retention procedure that each application team could fill in the content for their particular application.

Such a template could include the following section headings:

Application Name
Record Code with Retention Period.
Identification of Business Data subject to Retention.
Identification of System Data that will be maintained per Operational Need.
Deletion Process
- Deletion Roles and Responsibilities
- Deletion Cadence
- Identification of the Database Field that will service as the Trigger for the Retention Period (i.e., transaction date field, decommission date field, loan payoff or sale date field, employee termination date field, and so on.)
- Deletion Report/Certificate of Destruction content
- Repository for Deletion Report/Certificate of Destruction
Method for validating retention compliance

A convenient location for the completed procedure could be wherever the application architecture documents reside, which again, may be the CMDB.

However, while completion of a procedure for each application would provide a more complete information governance product, it would substantially increase the length and complexity of your classification project. So you must carefully consider the time resource available to you before making the decision to add them.

Defining the close out of your project.

How will you finish and close out your classification project? This will depend on the scope you choose. Will you endeavor to classify all the applications in your CMDB within the scope of the project? Or will you scope only the most critical applications within the scope of your project, convert the project process to a BAU process at the completion of the designated critical applications, and then transition to the BAU process for project closeout?

Consider that new applications will be added to your CMDB going forward, so an on-going process will be needed regardless. That is an argument for choosing a small, defined subset of critical applications to be the scope of your project. Further, the responsibility for the BAU classifications could be transferred to the Application Owners, rather than the RIM/IG team. Your BAU role, then, would become limited to monitoring and reporting.

CONCLUSION

Classification of data in enterprise applications is a multi-discipline Information Governance effort. It will include collaboration with stakeholders from various related disciplines. As such, it will greatly benefit you as the project leader to have and employ a basic understanding of those other disciplines – in the planning and execution of the project of course, but also in the interaction with the other stakeholders as well.

Leveraging your understanding of the basics of the complimentary disciplines, you will then lead the classification project using RIM principles and practices. You are the expert and the leader there.

Finally, employing classic project management basics will be the vehicle you use do execute on the classification. Project management methodology will be a discipline that all of your stakeholders will be familiar with, and in utilizing it you will further your project team members’ confidence in your ability as a leader to get the work done.

[i] Gartner, Information Technology Glossary. Definition of Architecture - IT Glossary | Gartner

Classifying Enterprise Data for Retention

About

Contact us