How to create a schema
  • Updated on 21 Jul 2020
  • 9 minutes to read
  • Print
  • Share
  • Dark
    Light

How to create a schema

  • Print
  • Share
  • Dark
    Light

Introduction to schemas

Aito database consists of one or more tables. The tables and their internal structure is defined in the database schema.

Aito database schemas are defined via JSON objects with the following structure.

{
  "schema": {
    "table-a": {
      "type":"table",
      "columns": {
        "column-1" : { "type": "Int" },
        "column-2" : { "type": "String" }     
      },
    }, {
      "table-b": {
        "type":"table",
        "columns": {
          ...
        }
      }
    }
  },
  ...
}

As visible in the structure, each schema can have multiple tables and each table can have multiple columns of different types. Each column can have the following properties defined:

  • type, which can be either int, decimal, boolean, string or text
  • nullable, which can be either true or false. If nullable is true, the value can be omitted or set to null.
  • link, which defines a symbolic reference to another row in this or another table.
  • analyzer, which can be only defined for text and which tells how the text is split into features/terms.
  • value, which describes the default value, when the value in the data row is omitted or null.

Making our first schema

Let's consider the schema through the following example. In the example, an imaginary e-commerce grocery store wants to use Aito to automatically categorize products and tag products. In order to this, they need a schema for the product catalog. The grocery store products have the following properties:

  • Identifier, which is an integer
  • Name, which is English text
  • Category, which is a keyword
  • Tags, which is a list of keywords
  • Available, which is a boolean, and it is used to mark objects, that can be bought
  • Price, which is a decimal number

To meet these requirements and based on the existing data, the following schema is designed:

{
  "schema"  :  {
    "products": {
      "type":"table",
      "columns": {
        "id":        { "type": "Int" },
        "name":      { "type": "Text",  "analyzer": "English" },
        "category":  { "type": "String", "nullable" : true },
        "tags":      { "type": "Text", "analyzer" : "Whitespace" },
        "available": { "type": "Boolean", "value" : false },
        "price":     { "type": "Decimal", "nullable": true }
       }
     }
   }
 }

It's worth noting, that the category field, which contains a keyword value, is typed as string. And the name field, which contains common English text, is typed as text with an english analyzer. The text analyzers will split the text into terms and normalize the terms. For example, the English analyzer may parse 'Fazer rye breads' into terms 'Fazer', 'rye' and 'bread'. This operation lets Aito inference later infer, that the term 'bread' is a strong predictor for the 'Bread' category if there is a datapoint of 'Vaasa soft bread' term belonging to the same 'bread' category.

Text vs string

If text is not treated as an English text, but as a keyword string, Aito will not recognize the shared 'bread' feature between 'Fazer rye breads' and 'Vaasa soft bread' texts and the inference will fail. Also e.g. a whitespace analyzer may lead Aito to treat 'breads' and 'bread' as separate things, and fail to find share properties among 'Fazer rye breads' and 'Vaasa soft bread'.

It's also worth noting that the category and price fields are set as nullable. This allows creating temporary products, that don't yet have category or price defined. id is not set as nullable, because an identifier is required to make the row referable. Name and tags are not nullable, as they can easily contain empty strings or 'not named' place holder. available field is not nullable, but it is set to false by default so it doesn't need to specified in the feed json.

It's also worth noting, that because Aito does not currently support lists, the 'tags' field keyword list is transformed into a whitespace-separated string.

Numeric values

By default, Aito treats all values as discrete. This means, that Aito won't make assumption that the values 3, 3.1 and 2.9 might have similar statistical behavior, but instead it is assumed that they behave like identifiers or keywords. To make Aito use the statistics of the nearby numeric values: use the $numeric proposition in the queries.

The database schema can be initialized by sending a PUT request to the $INSTANCE_URL/api/v1/schema endpoint.

To verify that the schema has been updated: you can send a GET request to the $INSTANCE_URL/api/v1/schema endpoint. Once the data has been uploaded, the table can be used in prediction queries like:

{
  "from": "products", 
  "where": {
    "name" : "Fazer rye bread"
  },
  "predict" : "category"
}

Making a schema with links for recommendations

Now let's assume, that the categorization project was a success, and the grocery store is looking for ways to help their customers' everyday life by implementing click-through rate (CTR) optimizing search functionality. To optimize a product impression CTR one needs information about a) the product, b) the click and c) contextual information about the usage. To keep better track of the impressions, the entries have also an identifier and a timestamp

  • id
  • timestamp
  • user
  • product identifier
  • click

In order to support searches, it must be possible to access the product information from the impression table. Because there can be e.g. 10 impressions per context. This means that the product identifier must be linked to the product table. This leads to the following schema:

{
  "schema"  :  {
    "products": {
      "type":"table",
      "columns": {
        "id":        { "type": "Int" },
        "name":      { "type": "Text",  "analyzer": "English" },
        "category":  { "type": "String", "nullable" : true },
        "tags":      { "type": "Text", "analyzer" : "Whitespace" },
        "available": { "type": "Boolean", "value" : false },
        "price":     { "type": "Decimal", "nullable": true }
       }
     }, 
     "impressions" : {
       "type" : "table", 
       "columns": {
         "id":       { "type" : "Int" }, 
         "timestamp":{ "type" : "Int" }, 
         "user":     { "type" : "Int" }, 
         "product":  { "type" : "Int", "link": "products.id" },
         "click":    { "type" : "Boolean" }
       }
     }
   }
 }

Aito doesn't yet support datetimes, so the user can either choose to represent times either as integer timestamps or as strings with formats like "20201010" or "20201010_080410".

Links

The link definitions are always of the format table.colum as visible int he product column. Both link type and the linked identifier (products.id) type must be the same for the linking to succeed. Aito is somewhat lenient with linking and the symbolic links without a target won't cause an error, but they will neither resolve. This means that the values behind the links will remain in the undefined/null state.

Once the new schema is PUT into the $INSTANCE_URL/api/v1/schema endpoint and once it's populated with data recommend queries can be used to provide recommendations or personalized search:

{
  "from": "impressions", 
  "where": {
    "user" : 5445,
    "product.title" : { "$match" : "milk" }
  },
  "recommend" : "product", 
  "goal": { "click" : true }
}

Making a schema for sequential data

As the last example. Let's consider a situation, where the grocery store is interested about the way how user's previous action affects his decisions. Often, the customer decisions don't happen in isolation, but instead they are a part of a longer sequence. In the grocery store setting: for example the customer queries and purchases are often time-wise related. After a person has bought Tortillas, he/she might be looking for Taco Sauce or Guacamole.

Let's next introduce another table called 'contexts', which describes a situation in the application, where the user is seeing impressions from recommendations, search results or catalog content. The table serves 2 purposes a) it simplifies the impression table by providing the impression metadata and b) it makes it possible to model the temporal sequences through self-links:

{
  "schema"  :  {
    "products": {
      "type":"table",
      "columns": {
        "id":        { "type": "Int" },
        "name":      { "type": "Text",  "analyzer": "English" },
        "category":  { "type": "String", "nullable" : true },
        "tags":      { "type": "Text", "analyzer" : "Whitespace" },
        "available": { "type": "Boolean", "value" : false },
        "price":     { "type": "Decimal", "nullable": true }
      }
    }, 
    "contexts" : {
      "type" : "table", 
      "columns" : {
        "id":             { "type" : "Int" }, 
        "timestamp":      { "type" : "Int" }, 
        "user":           { "type" : "String" },
        "prev":           { "type" : "Int", "link": "context.id", "nullable" : true }, 
        "prev_prev":      { "type" : "Int", "link": "context.id", "nullable" : true  }, 
        "prev_clickedProduct":      { "type" : "Int", "link": "products.id", "nullable" : true  }, 
        "query":          { "type" : "Text", "analyzer" : "English" },
        "queryPhrase":    { "type" : "String" },
        "clickedProduct": { "type" : "Int", "link" : "products.id, "nullable" : true " }
      }
    }, 
    "impressions" : {
      "type" : "table", 
      "columns": {
        "context":  { "type" : "Int", "link": "context.id" }, 
        "product":  { "type" : "Int", "link": "products.id" },
        "click":    { "type" : "Boolean" }
      }
    }
  }
}

What's worth noting that the impressions contains now a link to the context table and that the impression table is now smaller, as the id, timestamp and user identifier has been moved into the 'contexts' table.

Another important thing to notice is that the context table has now a link to the previous context. This makes the following kind of queries possible:

{
  "from" : "context", 
  "where" : {
      "prev.clickedProduct" : 4234234 // a tortilla product 
   },
   "predict" : "clickedProduct"
}

In the query, we are trying predict that what product the customer may click next, if the previous product was a tortilla (assuming that 4234234 is a tortilla product's ID).

Yet, the linked table contains also fields like prev_prev and prev_clickedProduct. Bluntly put, these fields exist to compensate Aito's current technical limitations, that relate to limited link depth. Mainly, one cannot do statistical reasoning with or even refer to prev.prev.query or context.prev.clickedProduct. For this reason, to make it possible to find the relationship between query two contexts a go and the query now: a flattened link prev_previs needed to relate prev_prev.querywith query. Similarly, prev_clickedProduct is needed for the following CTR-based recommendations:

{
  "from" : "impressions", 
  "where" : {
      "context.prev_clickedProduct" : 4234234 // a tortilla product 
   },
   "recommend" : "product", 
   "goal" : { "click" : true }
}
Link restrictions

Aito does additional preparations for each database link during write time. These preparations are needed to enable fast table-to-table statistical reasoning. This has 2 implications. First of all: the links do slow down the write operations by several tens of percents. Second: Aito does not yet do preparations through consecutive links, which means that in queries you can only refer to values behind one link. This is referred to as Aito having support only link depth of 1. We have ready plans to lift this restriction in the future.

As the last, but not least, observation: the context table contains separate query and queryPhrasefields. This is related to another Aito restriction. Aito has indexes and data structures only for Text features, while it doesn't have indexes for the text values themselves. This causes a behavior, where Aito treats every single text field value as distinct, even when they are not, and it can return duplicates within the results.

Now in order to predict user queries based on the previous query: the queryPhrase field is needed as in the following query:

{
  "from" : "contexts", 
  "where" : {
      "prev.query" : "spicy tortilla",
       "queryPhrase" : { "$startsWith" : "g" }
   },
   "get" : "queryPhrase", 
   "orderBy" : "$p"
}

The query may reveal that after users search for 'spicy tortilla' and types 'g', the next query will be about 'guacamole'. Such predictions can be used to provide query smart suggestions or analytics about user behavior.

Was this article helpful?