Resolving the COBOL Crisis with Artificial Intelligence

Resolving the COBOL Crisis with Artificial Intelligence

Dr. Alok Aggarwal

CEO and Chief Data Scientist

Scry Analytics, California, USA

Office: +1 408 872 1078; Mobile: +1 914 980 4717

June 20, 2020

Executive Summary

COBOL is a 61-year old computer language for processing data. Although highly inefficient by modern standards, millions of COBOL programs remain pervasive in government and industry and are responsible for transactions worth three trillion dollars. Recreating them in contemporary computer languages is extremely time consuming, laborious, and expensive. Also, there is an acute shortage of COBOL programmers since universities no longer teach this language. This article illustrates the use of Artificial Intelligence to decode legacy COBOL programs and reduce the dependence on COBOL programmers by 85% as well as the cost of conversion by 75%.


COBOL (“COmmon Business-Oriented Language”) was designed in 1959 by CODASYL to create an English-like, portable computer language for processing data [1]. In 1997, Gartner estimated that there were about 300 billion lines of computer code worldwide, with 80% (240 billion lines) of it in COBOL and 20% (60 billion lines) in all other computer languages combined [2,3]. Today, approximately 12 million COBOL programs with more than 200 billion total lines of code remain in use by organizations across information technology; education; financial services; healthcare; and retail sectors, and these handle three trillion dollars in commerce – mainly for batch transaction processing.

With the growth of the Internet and Cloud Computing, many new companies – particularly in finance and retail sectors – now serve customers in real-time, instead of batch mode. For example, customers can place an online order through Amazon or Target in seconds, and merchants can receive their credit card monies through Square or Stripe almost instantaneously. This is well beyond the original functionality of COBOL programs, which typically run in batch mode one or two times a day, thereby leading to substantially longer delays in order fulfilment and payments than is acceptable by modern standards.

The inability of COBOL programs to scale up and quickly handle so many simultaneous requests has now become vital. This urgency has become particularly pronounced during the COVID-19 pandemic, when outdated COBOL programs used by both federal and state governments have led to delays in disbursing funds and processing unemployment claims. Indeed:

  1. The US Internal Revenue Service scrambled to patch its COBOL-based Individual Master File in order to disburse around 150 million payments mandated by the Coronavirus Aid, Relief, and Economic Security (CARES) Act [4].
  2. With the ensuing unemployment surge in New Jersey, Governor Phil Murphy recently put out a call for volunteers who know how to code in COBOL, because many of New Jersey’s systems still run on old mainframes [5]
  3. Connecticut admitted that it too was struggling to process the large volume of unemployment claims with its 40-year-old COBOL mainframe system and is working to develop a new benefits system with the states of Maine, Rhode Island, Mississippi, and Oklahoma [7].

Impediments to replacing COBOL programs

The above issues highlight the need to replace COBOL programs with newer ones written in modern languages. However, understanding these COBOL programs is a huge impediment because of the following reasons:

  1. Spaghetti code: Unlike contemporary language programs, COBOL programs have intertwined pieces of code (“spaghetti code”), and since most COBOL programs have several thousand lines of code and deal with terabytes of data, updating them often produces inaccurate results or a complete breakdown.
  2. Verbosity: Although COBOL was meant to be easy for programmers to learn, use and maintain while still being readable to non-technical personnel such as managers, by 1984, many COBOL programs had become verbose and incomprehensible.
  3. Little documentation: Since COBOL was built to make the code self-documenting, little or no documentation was provided by the programmers. Hence, government agencies and businesses still rely on “folklore” and long retired COBOL programmers
  4. Lots of COBOL variants: Although meant to be extremely portable, around 300 dialects and 104,976 official variants were created by 2001, rendering maintenance extremely difficult [8].

Unfortunately, COBOL experts who can decipher these programs are in short supply. Estimates reveal only two million such programmers remain in the world with about half retired [3]. These numbers continue to decrease, as colleges have long stopped teaching this language due to the existence of better ones. The few graduating students who know COBOL do not want to work in it for the fear of being labelled as “blue-collar tech workers” [9]. With Tampa Bay Times, one COBOL programmer aptly summarized his experience transitioning from COBOL to Java when he said, “It’s taken them four years, and they’re still not done” [10]. Recently, Reuters reported that when Commonwealth Bank of Australia replaced its core COBOL platform in 2012, it took five year and cost $749.9 million. Finally, there are several solutions in the market for converting COBOL programs to those in other languages, but they also heavily rely on extensive use of COBOL programmers, and the cost and time needed to replace COBOL programs with modernized ones are immense.

Since the cost of replacing COBOL code is around 25 dollars per line [3, 11], the total cost and time of replacing 200 billion lines of code will be about five trillion dollars and 40 million person years, wherein approximately half (2.5 trillion dollars and 20 million person years) will be spent in deciphering COBOL programs. Fortuitously, since the number of non-COBOL programmers is around 24 million and growing [12], if these black-box COBOL programs could somehow decoded, then upgrading these programs to superior ones may only take a few years. However, there are only a million active COBOL programmers and their number is dwindling, the task of decoding the 12 million COBOL programs is likely to take at least twenty years, which will be the fundamental bottleneck going forward.

Artificial Intelligence to the rescue

As discussed above, replacing COBOL programs with those containing enhanced features (e.g., real-time capability and handling surges of requests) involves the following two tasks:

  1. Understanding the COBOL programs and creating flow-charts describing how they work.
  2. Using these flow-charts to create new programs with improved features in contemporary languages.

Evidently, the lack of COBOL programming expertise and the corresponding huge expense that it entails, are the biggest hurdles in replacing legacy COBOL programs. Fortunately, the following two reasons have enabled us to develop Artificial-Intelligence (AI) based software to help decipher COBOL programs, thereby reducing the conversion time and cost by 75%, and dropping the number of COBOL programmers required to just 15%:

  1. COBOL is a comparatively a simple language with no pointers, no data structures, no user definedfunctions or types, and no recursion, and with data types being only numbers or text.
  2. Most COBOL programs spend around 70% of their time executing input/output and read/write operations and their output tables provide a good synopsis of the entire execution.

Collatio® -Data Flow Mapping software from Scry Analytics ingests all input and output tables related to a given COBOL program and uses proprietary AI-based algorithms to reverse engineer the transformations that are performed by this COBOL program, thereby inferring the steps executed by this program, and helping the user create a flow-chart of the program’s inner workings.

Below, we explain how this software works via an example of a legacy COBOL program for approving or rejecting unemployment claims filed by 100,000 people during the ongoing COVID-19 crisis:

  1. One input table ingested by the legacy COBOL program: Typical forms for filing unemployment claims are 10 to 15 pages long and require the applicant to fill 250-300 fields and values (e.g., names, addresses, social security numbers, past employment details). Hence, this table is likely to contain 100,000 rows (one for each applicant) and 250 to 300 columns.
  2. Other potential input tables ingested by the legacy COBOL program: These are likely to include “Single Source of Truth” tables that are used to verify the identity of applicants as well as various employers, etc. Other tables may also include conditions for providing unemployment wages and the formulas for computing the corresponding amounts. Alternatively, some of this information may be “hard-coded” in the legacy program itself
  3. Execution of the legacy program: Once or a few times a day, this program will ingest all input tables, and using these tables and the hard-coded information and formulas (including “If ….Then … Else….” type of formulas), it will be executed in a “batch mode.”
  4. Output of the legacy program : To ensure data lineage, auditability, and traceability, as the program goes through a set of instructions, it is likely to write one or more results in the output table(s). For example, after checking the social security number and driver’s license, it may realize that the applicant made a typographical error in the last name and it will use the actual name provided in one of the “Single Source of Truth” tables, and provide it in one of the output columns. In another output column, it may provide the date-time stamp as to when it accomplished this task. In another column, it may output the approval/rejection result, and yet in another one, it may provide the computed unemployment wage that will be provided to this applicant. In summary, the output table contains a “fingerprint” of all the steps executed by this program and this output table may also be around 100,000 rows and 200 to 300 columns. In fact, the number of rows in the output table may be different than that in the input table, especially if there are duplicate or blank rows.

Since most COBOL programs spend around 70% of their time doing input-output read-write and the remaining 30% in calculating formulas and in manipulating numbers and strings, Collatio®-Data Flow Mapping software uses the “finger print” given in the output tables, and “reverse engineers” to determine the transformations. It primarily uses the cell values in the input and output tables. More precisely, it determines which columns of input tables and what potential formulas and constants (that are hard coded in the legacy program) are being used to produce each column of the output table. During the entire process, it seldom uses column names and since most COBOL programs have little or no documentation, it almost never uses the corresponding ontology (i.e., text-based relationships among various columns).

Key features of this decision support software are given below:

  1. The proprietary algorithms in this software are probabilistic and use advanced techniques from Math and Computer Science, and specifically, from Artificial Intelligence (Machine Learning and Natural Language Processing) and Operations Research.
  2. Since the underlying algorithms are AI-based, the software provides the transformations along with a confidence level (a plausible transformation along with a probability of being correct) and an accuracy measure (percentage of records in the output table that represent the transformation function identified) for each transformation.
  3. Since it is a decision support software, it comes with a preconfigured graphical user interface (GUI) that depicts the transformations among various columns and helps the user in experimenting and determining the steps being executed by the COBOL program. It also provides an API for downloading these transformations to a spreadsheet or JSON format.
  4. Unsurprisingly, this software is highly complete and memory intensive. Hence, it has been optimized with respect to parallel and distributed processing. The number of processing cores and amount of random-access memory (RAM) required depends upon the size of the input and output tables, which can easily contain a thousand columns and several million rows.