2021-06-16

2021-06-18

dali index serialization format

Index and transaction log segments could define value types once and the rows would then contain values without type tags.

Optimizing indexes for storage space and read performance should be separated from the application logic. Application could accept arbitrary precision integers but when an index segment is written to the disk it could be noted that using U16 for all the numbers without type tags is the optimal way of serializing the values in that segment.

Strings could use VLQ:s to define the length of the string.

how should dates be stored?

Record types consisting of primitive types could be described with statements. Each record type and attributes in it would be represented with a global entity ids.

One date format could be defined as:

[:year i16
 :month u8
 :day u8
 :date [:year :month :day]]

abstract type level record type definitions

Record types could be described with abstract types without specifying the precies serialization format. For example a date could be defined as:

[:year integer
 :month integer
 :day integer]

how to express sequences in a graph database?

There should be a tuple value type. Abstract tuple type could have arbitrary number of heterogeneous values. Concrete tuple type could fix the tuple length and value types.

enumeration value types

An enumeration value type could define a primitieve integer type as the storage format and entity ids for each value. Each entity describes how the corresponding number should be interpreted.

Operator in a statement would be one example of an enumeration.

goals of dali

A data structure that provides space and performance efficient way of sharing large amounts of structured information between information systems in a near real time manner.

The data should describe itself in a way that humans can understand it. It should be possible to generate API:s for a given application specific schema for any programming language using the native data types of that programming language. Dynamic programming languages should be automatically be able to provide access to given dali data structure without code generation.

Data should be modelled in a way that combining data from different systems should be natural. Entity ids and attributes in the schemas should be globally unique.

Given urls to multiple dali data structures it should be possible to present a user friendly interface for browsing and querying the data as one big database. The data required for running the queries should be cacheable near the user.

The data structure sould allow distributing it efficiently using a peer to peer protocol.

It should be possible to document schemas in language agnostic way. Attributes could have many names and describtions in different languages.

Schemass should be optional. A dali datastructure should be able to store semistructured data. If the data has regular structure it would only mean that the data can be stored more space efficiently on disk, but if there are exceptions in the structure the serialization format would automatically change to a more verbose one that supports such irregularities.

how should schemas be distributed?

Data refers to attributes with entity ids. How could a data reader retrieve metadata about those attributes to represent the data in a human readable format?

The data segments could include urls of dali structures that contain the metadata.
The metadata could be included in the stream.
There could be global DNS like system that maintains mapping betwen origins and their URL:s.

Since schemas change slowly, they are highly cacheable and they could be shared in a peer to peer manner.

should origins be identified with strings or uuids?

If origin identifiers would follow domain name like hierarchical structure, it would be possible to share responsibility for mapping origin to their urls like in DNS.

DNS itself could be used to hold records that describe where a given origin can be accessed.

Using domain names as origin identifiers would allow using SSL certificates to sign data segments to ensure their authenticity.

schema versions

Origin id and transaction number form a unique identifier of a schema version described in that origin.

there should be a plain text format for dali data for debugging purposes

→ Variable length quantity

← 2021-06-21

← 2021-08-24

← 2021-08-26

← byte size prefixed values

← identifier number size does not matter for storage space and readability

← timestamps as offsets

This site is generated with zetgen