15min solution to missing data problem
- Updated On 20 Jul 2020
- 7 Minutes To Read
-
Print
-
DarkLight
Robotic Process Automation (RPA) users very commonly encounter problems with missing data points in the inputs of their automations. When data is partially unavailable, it makes the automation as useful as a broken bridge, and automation remains low, as human still needs to provide inputs. This tutorial has been made after we were helping a customer predict several datapoints in purchase invoice processing: VAT category, factual checker, PO number, GL account and so on.
Before you begin
Getting an Aito instance
If you want to follow along with this tutorial, you'll have to go and get your own Aito instance from the Aito Console.
- Sign in or create an account, if you haven't got one already.
- In the Aito Console go to the instances page and click the "Create an instance" button.
- Select the instance type you want to create and fill in needed fields, Sandbox is the free instance for testing and small projects. Visit our pricing page, to learn more about the Aito instance types.
- Click "Create instance" and wait for a moment while your instance is created, you will receive an email once your instance is ready.
If you're totally new to Aito the get started guides might be useful before reading this tutorial.
Accessing instance details
After the instance creation is ready, you can access the URL and API keys of the instance by
- Log in to Aito Console
- Click on the instance your created
- Go to the overview page. You can copy the API keys after clicking the eye icon.
Get the Aito Command Line Interface tool
- Get the Aito CLI tool if you don't have it yet (you'll need Python3.6 or higher installed)
pip install aitoai
- Configure the CLI using the instance URL and read/write API key.
aito configure
Dataset
We are using a simple dataset from Kaggle with a target to predict a product category from available details. This has been used as a hiring challenge for data scientists in India.
We have a training set (Train.csv) of 5566 rows of comma-separated values including the target variable Product_Category
, it looks like this. For improved readability, the description field is concatenated in the table.
Inv_Id | Vendor_Code | GL_Code | Inv_Amt | Item_Description | Product_Category | ||
---|---|---|---|---|---|---|---|
15001 | VENDOR-1676 | GL-6100410 | 83.24 | Artworking/Typesetting ... | CLASS-1963 | ||
15002 | VENDOR-1883 | GL-2182000 | 51.18 | Auto Leasing Corporate ... | CLASS-1250 | ||
15004 | VENDOR-1999 | 6050100 | 79.02 | Store Management Lease/Rent ... | CLASS-1274 | ||
... | ... | ... | ... | ... | ... |
Each row represents invoice data, with vendor code Vendor_Code
, general ledger code GL_Code
, invoice amount Inv_Amt
and a textual description of what was invoiced Item_Description
.
Kaggle dataset comes with a test set (Test.csv), which does not contain ground truth. That's why we, later on, evaluate Aito's performance by setting aside part of the training set for test purposes.
How to implement missing value predictor with Aito
Aito is a predictive database, which runs machine learning inference directly within the database, based on queries resembling SQL. Aito is well suited for classification tasks with dynamically changing prediction target (new Product_Category
values are automatically incorporated to predictions).
Prerequisites
First, you will need an Aito instance to replicate the steps below yourself. Get a free Sandbox here.
Second, we are assuming you have installed Aito CLI tools and set up necessary environment variables.
Upload the dataset with CLI
Let's use the Aito CLI tool for an amazingly quick data upload. Having a simple CSV file like ours, quick-add-table does exactly what is needed. It infers the Aito schema from the dataset, creates a table with the schema and uploads the dataset. All in one command.
Changing the filename to something more intuitive like InvoiceData.csv will help you later, as it's used to name the table in Aito.
mv Train.csv InvoiceData.csv
aito quick-add-table InvoiceData.csv
Review schema
Next, check that the schema Aito inferred looks sensible.
(Sidenote: in this tutorual we are doing all the Aito queries from command line so that they are quick to copy to your own terminal. You can accomplish the same results by sending requests with a REST client like Insomnia.)
curl -X GET \
https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/schema/InvoiceData \
-H "x-api-key: $AITO_API_KEY" \
-H "content-type: application/json"
The output is the schema (below), and by the looks of it all is correct. Aito has decided that the Item_Description
field should be analyzed in English. Aito supports other analysers, but let's not get sidetracked.
{
"columns": {
"GL_Code": {
"nullable": false,
"type": "String"
},
"Inv_Amt": {
"nullable": false,
"type": "Decimal"
},
"Inv_Id": {
"nullable": false,
"type": "Int"
},
"Item_Description": {
"analyzer": "english",
"nullable": false,
"type": "Text"
},
"Product_Category": {
"nullable": false,
"type": "String"
},
"Vendor_Code": {
"nullable": false,
"type": "String"
}
},
"type": "table"
}
Run a prediction
Let's make a quick test. Choose a random row from the Test.csv and use that to predict Product_Category
that is missing. It is done with Aito's _predict API endpoint.
Let's look at the query that goes into request body. "Knowns" are defined in a where-clause, meaning the values we get from the chosen test data row. Then set the prediction target to be Product_Category
, limiting to top the three predictions only.
curl -X POST \
https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/_predict \
-H "x-api-key: $AITO_API_KEY" \
-H "content-type: application/json" \
-d '
{
"from": "InvoiceData",
"where": {
"GL_Code": "GL-6101400",
"Inv_Amt": 55.93,
"Item_Description": "Arabian American Development Co Final Site Clean Up 2008-Oct General Requirements General Contractor Store Construction",
"Vendor_Code": "VENDOR-1254"
},
"predict": "Product_Category",
"limit": 3
}'
The response body is what we are looking for. It looks very promising! Pay attention to the first element. $p
is showing that Aito is very sure (99.9%) that Product_Category
is CLASS-1522
, and nothing else. Following two predictions are extremely unlikely, having $p
value close to zero.
{
"offset": 0,
"total": 36,
"hits": [
{
"$p": 0.9998137062471193,
"field": "Product_Category",
"feature": "CLASS-1522"
},
{
"$p": 7.345473788575557E-5,
"field": "Product_Category",
"feature": "CLASS-1828"
},
{
"$p": 3.118734024876525E-5,
"field": "Product_Category",
"feature": "CLASS-1983"
}
]
}
But wait! What if the $p
is not that high? Most of our customers choose a threshold (like 90% or higher) and use the prediction result in automation if probability exceeds the threshold. Anything lower, the case is returned for human processing with suggestions.
Evaluating accuracy
Aito has inbuilt functionality to evaluate the accuracy of the models. We discuss that in the this How-to article so follow the link for more info.
With the query below Aito splits the data into two parts: test set (every third entry) and training set (the rest) and then automatically runs test queries and returns results. If you want to try this yourself, you'll need to familiarise yourself with Aito's Jobs, all explained in this article.
{
"test": {
"$index": { "$mod": [3, 0] }
},
"evaluate": {
"from": "InvoiceData",
"where": {
"GL_Code": {"$get": "GL_Code"},
"Inv_Amt": {"$get": "Inv_Amt"},
"Item_Description": {"$get": "Item_Description"},
"Vendor_Code": {"$get": "Vendor_Code"}
},
"predict": "Product_Category"
},
"select": ["trainSamples", "testSamples", "accuracy", "error"]
}
Once the above query is posted as a job, and Aito finishes running it, you are able to fetch the result, which looks like this:
{
"trainSamples": 3711.0,
"testSamples": 1856,
"accuracy": 0.9983836206896551,
"error": 0.001616379310344862
}
With 1856 test samples, Aito's accuracy was 99.8%. Pretty awesome for 15 minutes effort.
Operationalise
At this point, you actually have a fully functional prediction service (requests we were sending to _predict earlier) running in the cloud, and you could just start shooting in more queries just like the one above.
In a real-life scenario, you might have an initial training dataset as a starting point. Afterwards, new data points are added to Aito from automation script. Aito will use all the new data without the need for a separate process to re-train and deploy models, learning as you go.
Here is how to add new data points one by one using Aito's database API.
curl -X POST \
https://$AITO_INSTANCE_NAME.api.aito.ai/api/v1/data/InvoiceData \
-H "content-type: application/json" \
-H "x-api-key: $AITO_API_KEY" \
-d '
{
"GL_Code": "GL-9900990",
"Inv_Amt": 39.00,
"Inv_Id": 34001,
"Item_Description": "Predictive database monthly subscription, developer tier",
"Product_Category": "CLASS-9999",
"Vendor_Code": "VENDOR-9999"
}'
Summary
We created a production-grade version of a missing data prediction service in about 15 minutes with Aito. Groundbreaking for RPA teams with engineers but possibly getting no TLC from data scientists.
Aito can be used to predict any of the missing fields in the dataset with the same simplicity. We did not touch this in the prediction part, but you can essentially just change the knowns and predict another variable. There is no static trained model that restricts the predictions. By having data uploaded in Aito you have already done the majority of the work.