15min solution to missing data problem
  • 20 Jul 2020
  • 7 Minutes To Read
  • Print
  • Share
  • Dark
    Light

15min solution to missing data problem

  • Print
  • Share
  • Dark
    Light

Robotic Process Automation (RPA) users very commonly encounter problems with missing data points in the inputs of their automations. When data is partially unavailable, it makes the automation as useful as a broken bridge, and automation remains low, as human still needs to provide inputs. This tutorial has been made after we were helping a customer predict several datapoints in purchase invoice processing: VAT category, factual checker, PO number, GL account and so on.

Before you begin

Getting an Aito instance

If you want to follow along with this tutorial, you'll have to go and get your own Aito instance from the Aito Console.

  1. Sign in or create an account, if you haven't got one already.
  2. In the Aito Console go to the instances page and click the "Create an instance" button.
  3. Select the instance type you want to create and fill in needed fields, Sandbox is the free instance for testing and small projects. Visit our pricing page, to learn more about the Aito instance types.
  4. Click "Create instance" and wait for a moment while your instance is created, you will receive an email once your instance is ready.

If you're totally new to Aito the get started guides might be useful before reading this tutorial.

Accessing instance details

After the instance creation is ready, you can access the URL and API keys of the instance by

  1. Log in to Aito Console
  2. Click on the instance your created
  3. Go to the overview page. You can copy the API keys after clicking the eye icon.

api_info

Get the Aito Command Line Interface tool

  1. Get the Aito CLI tool if you don't have it yet (you'll need Python3.6 or higher installed)
pip install aitoai
  1. Configure the CLI using the instance URL and read/write API key.
aito configure

Dataset

We are using a simple dataset from Kaggle with a target to predict a product category from available details. This has been used as a hiring challenge for data scientists in India.

We have a training set (Train.csv) of 5566 rows of comma-separated values including the target variable Product_Category, it looks like this. For improved readability, the description field is concatenated in the table.

Inv_Id Vendor_Code GL_Code Inv_Amt Item_Description Product_Category
15001 VENDOR-1676 GL-6100410 83.24 Artworking/Typesetting ... CLASS-1963
15002 VENDOR-1883 GL-2182000 51.18 Auto Leasing Corporate ... CLASS-1250
15004 VENDOR-1999 6050100 79.02 Store Management Lease/Rent ... CLASS-1274
... ... ... ... ... ...

Each row represents invoice data, with vendor code Vendor_Code, general ledger code GL_Code, invoice amount Inv_Amt and a textual description of what was invoiced Item_Description.

Kaggle dataset comes with a test set (Test.csv), which does not contain ground truth. That's why we, later on, evaluate Aito's performance by setting aside part of the training set for test purposes.

How to implement missing value predictor with Aito

Aito is a predictive database, which runs machine learning inference directly within the database, based on queries resembling SQL. Aito is well suited for classification tasks with dynamically changing prediction target (new Product_Category values are automatically incorporated to predictions).

Prerequisites

First, you will need an Aito instance to replicate the steps below yourself. Get a free Sandbox here.

Second, we are assuming you have installed Aito CLI tools and set up necessary environment variables.

Upload the dataset with CLI

Let's use the Aito CLI tool for an amazingly quick data upload. Having a simple CSV file like ours, quick-add-table does exactly what is needed. It infers the Aito schema from the dataset, creates a table with the schema and uploads the dataset. All in one command.

Changing the filename to something more intuitive like InvoiceData.csv will help you later, as it's used to name the table in Aito.

mv Train.csv InvoiceData.csv
aito quick-add-table InvoiceData.csv

Review schema

Next, check that the schema Aito inferred looks sensible.

(Sidenote: in this tutorual we are doing all the Aito queries from command line so that they are quick to copy to your own terminal. You can accomplish the same results by sending requests with a REST client like Insomnia.)

curl -X GET \
https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/schema/InvoiceData \
-H "x-api-key: $AITO_API_KEY" \
-H "content-type: application/json"

The output is the schema (below), and by the looks of it all is correct. Aito has decided that the Item_Description field should be analyzed in English. Aito supports other analysers, but let's not get sidetracked.

{
  "columns": {
    "GL_Code": {
      "nullable": false,
      "type": "String"
    },
    "Inv_Amt": {
      "nullable": false,
      "type": "Decimal"
    },
    "Inv_Id": {
      "nullable": false,
      "type": "Int"
    },
    "Item_Description": {
      "analyzer": "english",
      "nullable": false,
      "type": "Text"
    },
    "Product_Category": {
      "nullable": false,
      "type": "String"
    },
    "Vendor_Code": {
      "nullable": false,
      "type": "String"
    }
  },
  "type": "table"
}

Run a prediction

Let's make a quick test. Choose a random row from the Test.csv and use that to predict Product_Category that is missing. It is done with Aito's _predict API endpoint.

Let's look at the query that goes into request body. "Knowns" are defined in a where-clause, meaning the values we get from the chosen test data row. Then set the prediction target to be Product_Category, limiting to top the three predictions only.

curl -X POST \
https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/_predict \
-H "x-api-key: $AITO_API_KEY" \
-H "content-type: application/json" \
-d '
{
  "from": "InvoiceData",
  "where": {
    "GL_Code": "GL-6101400",
    "Inv_Amt": 55.93,
    "Item_Description": "Arabian American Development Co Final Site Clean Up 2008-Oct  General Requirements General Contractor Store Construction",
    "Vendor_Code": "VENDOR-1254"
  },
  "predict": "Product_Category",
  "limit": 3
}'

The response body is what we are looking for. It looks very promising! Pay attention to the first element. $p is showing that Aito is very sure (99.9%) that Product_Category is CLASS-1522, and nothing else. Following two predictions are extremely unlikely, having $p value close to zero.

{
  "offset": 0,
  "total": 36,
  "hits": [
    {
      "$p": 0.9998137062471193,
      "field": "Product_Category",
      "feature": "CLASS-1522"
    },
    {
      "$p": 7.345473788575557E-5,
      "field": "Product_Category",
      "feature": "CLASS-1828"
    },
    {
      "$p": 3.118734024876525E-5,
      "field": "Product_Category",
      "feature": "CLASS-1983"
    }
  ]
}

But wait! What if the $pis not that high? Most of our customers choose a threshold (like 90% or higher) and use the prediction result in automation if probability exceeds the threshold. Anything lower, the case is returned for human processing with suggestions.

Evaluating accuracy

Aito has inbuilt functionality to evaluate the accuracy of the models. We discuss that in the this How-to article so follow the link for more info.

With the query below Aito splits the data into two parts: test set (every third entry) and training set (the rest) and then automatically runs test queries and returns results. If you want to try this yourself, you'll need to familiarise yourself with Aito's Jobs, all explained in this article.

{
  "test": {
    "$index": { "$mod": [3, 0] }
  },
  "evaluate": {
    "from": "InvoiceData",
    "where": {
      "GL_Code": {"$get": "GL_Code"},
      "Inv_Amt": {"$get": "Inv_Amt"},
      "Item_Description": {"$get": "Item_Description"},
      "Vendor_Code": {"$get": "Vendor_Code"}
    },
    "predict": "Product_Category"
  },
  "select": ["trainSamples", "testSamples", "accuracy", "error"]
}

Once the above query is posted as a job, and Aito finishes running it, you are able to fetch the result, which looks like this:

{
  "trainSamples": 3711.0,
  "testSamples": 1856,
  "accuracy": 0.9983836206896551,
  "error": 0.001616379310344862
}

With 1856 test samples, Aito's accuracy was 99.8%. Pretty awesome for 15 minutes effort.

Operationalise

At this point, you actually have a fully functional prediction service (requests we were sending to _predict earlier) running in the cloud, and you could just start shooting in more queries just like the one above.

In a real-life scenario, you might have an initial training dataset as a starting point. Afterwards, new data points are added to Aito from automation script. Aito will use all the new data without the need for a separate process to re-train and deploy models, learning as you go.

Here is how to add new data points one by one using Aito's database API.

curl -X POST \
  https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/data/InvoiceData \
  -H "content-type: application/json" \
  -H "x-api-key: $AITO_API_KEY" \
  -d '
  {
    "GL_Code": "GL-9900990",
    "Inv_Amt": 39.00,
    "Inv_Id": 34001,
    "Item_Description": "Predictive database monthly subscription, developer tier",
    "Product_Category": "CLASS-9999",
    "Vendor_Code": "VENDOR-9999"
  }'

Summary

We created a production-grade version of a missing data prediction service in about 15 minutes with Aito. Groundbreaking for RPA teams with engineers but possibly getting no TLC from data scientists.

Aito can be used to predict any of the missing fields in the dataset with the same simplicity. We did not touch this in the prediction part, but you can essentially just change the knowns and predict another variable. There is no static trained model that restricts the predictions. By having data uploaded in Aito you have already done the majority of the work.

Was This Article Helpful?