Datamodel

Data model #

This chapter describes how Proxima platform maps its data model to in terms of entities and attributes on the core abstraction - a commit log (a stream).

Entities #

The platform defines the complete data model in terms of abstract entities. Each entity has a key and a set of attributes that have a type and associated serialization.

A key is unique string identification of any particular instance of an entity. Attributes is a set of properties that the particular entity can have. We can imagine that as a (sparse) table as follows:

Entity keyAttributeXAttributeY
first12
second23

Here we see two instances of a particular entity with attribute Attribute1 and Attribute2 with type int.

The type and serialization of all attributes is defined by a scheme and associated ValueSerializer. Let’s assume we have an eshop which sells some goods. A typical eshop will have a database of products and users and will want to track behavior of users on the website in order to provide some level of personification. Given such use case, we would define entities in HOCON configuration as follows:

entities {
  # user entity, let's make this really simple
  user {
    attributes {

      # some details of user - e.g. name, email, ...
      details { scheme: "proto:cz.o2.proxima.example.Example.UserDetails" }

      # model of preferences based on events
      preferences { scheme: "proto:cz.o2.proxima.example.Example.UserPreferences" }

      # selected events are stored to user's history
      "event.*" { scheme: "proto:cz.o2.proxima.example.Example.BaseEvent" }

    }
  }
  # entity describing a single good we want to sell
  product {
    # note: we have to split to separate attributes each attribute that we want to be able
    # to update *independently*
    attributes {

      # price, with some possible additional information, like VAT and other stuff
      price { scheme: "proto:cz.o2.proxima.example.Example.Price" }

      # some general details of the product
      details { scheme: "proto:cz.o2.proxima.example.Example.ProductDetails" }

      # list of associated categories
      "category.*" { scheme: "proto:cz.o2.proxima.example.Example.ProductCategory" }

    }
  }

  # the events which link users to goods
  event {
    attributes {

      # the event is atomic entity with just a single attribute
      data { scheme: "proto:cz.o2.proxima.example.Example.BaseEvent" }

    }
  }
}

Such configuration defines three entities - user, product and event. Entity user has three attributes - details, preferences and event.*. Let’s see what this means in the following section.

Attributes #

Each attribute is of two possible types:

  • scalar attribute
  • wildcard attribute
Scalar attributes #

Details and preferences are examples of a scalar attribute of entity user. Such attribute may be present or missing (be null), but if present it can have only single value of given type. The type of the attribute is given by its scheme. Scheme is URI which points to a ValueSerializerFactory, which creates instances of ValueSerializer. This serializer is then used whenever the platform needs to convert the object representing the attribute’s value to bytes and back.

The scheme proto: used in the example above declares that the attribute will be serialized using ProtoValueSerializer and hold a corresponding class that was generated using protoc (and extends Message). For details refer to protocol buffers documentation.

Wildcard attributes #

Attribute event.* of entity user is an example of wildcard attribute. Such attribute can be viewed as a collection of key-value pairs. That is to say - there may exist multiple instances of attribute event.*. The asterisk (*) represents a suffix of the wildcard attribute. The suffix can hold string data that represent the specific instance of the attribute. An example might be attribute event.640ab744-3b3e-11ed-936b-e5b6cd08b011 or event.719aadec-3b3e-11ed-936b-e5b6cd08b011, which point to entity event with key 640ab744-3b3e-11ed-936b-e5b6cd08b011 and 719aadec-3b3e-11ed-936b-e5b6cd08b011, respectively. Wildcard attributes are useful to represent:

  • lists (iterables)
  • maps
  • relations

We will see examples of all these uses throughout this book.

StreamElement #

The platform handles all data as data streams consisting of upserts and deletions of data. Each upsert or delete is an immutable event describing that a new data element was added, updated or removed. Every StreamElement consists of the following parts:

entityattributekeytimestampvaluedelete wildcard flag

Entity is represented by EntityDescriptor, attribute is represented by its name (which is especially needed for wildcard attributes, because name of the attribute does not represent a specific instance of the attribute) and AttributeDescriptor.

Key, timestamp and value are the key of the entity (representing a “row”), timestamp is epoch timestamp in millis representing the instant at which the particular change (upsert or delete) happened and value is the new updated value (for upserts) or null (for deletes).

Delete wildcard flag is a special delete event that deletes all instances of a wildcard attribute.

A stream is an unbounded, generally unordered sequence of stream elements. For example, let’s see a part of stream consisting of the following elements:

entityattributekeytimestampvaluedelete wildcard flagtype
userdetailsme1234567890500….falseupsert
userpreferencesyou1234567890400….falseupsert
eventdata${UUID}1234567890900….falseupsert
userdetailsother1234567890300nullfalsedelete
productcategory.*book1234567890900nulltruedelete wildcard

Such stream would represent events, in the same order as in the table above:

  • insert new details of entity user, key me with given value at timestamp 1234567890500
  • insert new preferences of entity user, key you with given value at timestamp 1234567890400
  • insert new event, attribute data, with key of given UUID and given value at 1234567890900
  • delete details of user other at 1234567890300
  • delete all attributes category.* from entity product, key book at 1234567890900

Stream-table duality #

A stream-table duality is a technique of converting streams of upserts and deletes into a table view. This is essential for the platform, as it defines how to reduce a stream to a snapshot at given timestamp. Snapshot of a stream at time T is a collection of all stream elements that were written with timestamp <= T.The reduction performs compaction of the stream, so that duplicates (updates of the same attribute) are resolved so that only the most recent change is kept (or the attribute is deleted, if the most recent element is delete).

Let’s demonstrate this on the same stream we have in previous section. Let’s suppose, that we are starting with the following snapshot:

entityattributekeytimestampvalue
userdetailsother1234567890000….
productdetailscar1234567880100….
productcategory.booksbook1234567870000….

Applying our previous stream to this snapshot we receive:

entityattributekeytimestampvalue
productdetailscar1234567880100….
userdetailsme1234567890500….
userpreferencesyou1234567890400….
eventdata${UUID}1234567890900….

In the following chapter, we will use these concepts to see how Proxima platform maps streams and snapshots to different storages.