Building application telemetry using MariaDB JSON columns

Motivation for app telemetry

Modern web applications perform a lot of tasks behind the scenes. They frequently gather data from users, store it in a repository, and provide it to either the same or different users. Most apps operate and store data using a state machine. This means that whenever a user sends data to the app, the data gets stored, certain calculations are executed, and further modifications are made to the database. This action alters the entire state of the database, thereby creating a new state. Reverting to a previous state is often difficult, and the sequence of changes gets forgotten. The advantage of this approach is the immediate availability of the current state and the minimized storage of data, which is typically the desired scenario in most cases.

In such projects, it’s common to require not just basic functionality but also statistical insights. While certain statistics can be derived from stored data, many have already been lost. With the state machine model, not all transitions are retained, limiting the ability to conduct detailed analysis beyond basic calculations like sums.

Telemetry image

Basic idea

Let us store an event at every transition we would possibly measure. The structure of the data will differ for every event because we need to store all necessary data at the moment. It will help us perform most of the statistical insights in the future.

Here is a little example. Our app is processing the house construction process, and we follow the client from the very first contact. At some point, customer requests an appointment, then the appointment takes place, then the contract is signed, and at the end, the contract is finished.

Side effects

Why to use MariaDB

Storing such a variety of data is usually a task for the NoSQL database. One of the most common technologies used for such purpose is ElasticStack, which stores the data in a document store and is able to show quick stats in Kibana. Yet setting up the whole ElasticStack and keeping it up to date is not trivial.

Many projects use MySQL or MariaDB as a primary data store. It would be wise to use it also for telemetric data, but a relational database is not a good help here, because the data is varying. And here come the new JSON features, effectively combining the advantages of both worlds - SQL and NoSQL. You can query data inside of the JSON, you can alter the data in JSON - add new properties, and delete and update old ones. It is almost as if you have a document store.

There is also a secondary advantage we get here - we can store the telemetric data in one transaction with a primary task.

Structure of stored data

To store an event in the database, we create a table called snp_event. The table has five columns: id, name, time, gid, and data.

CREATE TABLE `snp_event` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(128),
  `time` datetime,
  `gid` varchar(128),
  `data` mediumtext,
  PRIMARY KEY (`id`),
  KEY `INDEX_1` (`name`),
  KEY `INDEX_2` (`time`),
  KEY `INDEX_3` (`gid`)
) ENGINE=InnoDB;

There is nothing special on these columns:

id - auto increment key of the event
name - short name of the event e.g. ClientAdded
time - date time of the event
gid - global identifier of the event e.g. 0f20307b-78aa-403a-9a2e-897a6e336f60 - can be the same when event has more parts
data - text column holding JSON structured data of the event. Because the JSON columns are actually text columns, the type is mediumtext.

More interesting is what we actually write to the JSON data column. The JSON format lets you store a tree structure in general, but I would recommend keeping the immersion to one level only and thus taking advantage of the variability of the data, but keeping the complexity of data and queries as low as it gets.

Data stored in JSON can be divided into three groups:

measured properties
They can hold whatever you need for your statistical insights. If sensitive information occurs, I use a hash function. This way you can perform a join with other data, and keep sensitive information secret. This is also done with regard to future transition to ElasticSearch.
duplication of other columns (gid, name, time)
I recommend duplication of data stored outside the JSON in case you will move the data to ElasticSearch in the future.
ids of entities involved in the event
I keep ids to easily add new measured properties when this need arises.

{
  // measured properties
  "phone":"ee439e093b3e2505a1aa85884ad840014858baa0818bdb932c7b4c5097867413",
  "email":"c056c2f7da5f6904c5411a37823b3fe6bc7ef45e0735d95fca416c96d07c1a65",
  "company":1,
  "numberOfAppointmentsInStack":12,

  // data from other columns
  "id":"44ae2d9e-9b38-11ee-995f-e6889ca5fb49",
  "event":"AppointmentAdded",
  "time":"2023-12-15 11:54:06",

  // ids of entities involved in the event
  "appointmentId":86043,
  "userId":518,
  "managerId":178,
  "officeId":49
}

Simple stats built on collected data

Now that we have some data collected in snp_event table, we can gather statistics.

Examples of what you can get from our sample data:

compare the count of new appointment requests for last year with the preceding year
compare the count of appointments in the stack with last year’s data
show the development of new appointment requests in time for last year
show how many contracts were signed from certain UTM parameters
show how many contracts were finished in this year (month, week)

Here are some more detailed examples:

1) We can visualize the development of new appointments in time grouped by weeks for a whole year.

-- list of week's numbers of new appointments for whole year 2023
SELECT 
  WEEK(time),
  count(*)
  FROM snp_event 
  WHERE  name='AppointmentAdded'
    AND time BETWEEN '2023-01-01' AND '2023-12-31'
  GROUP BY WEEK(time)
  ORDER BY WEEK(time);

2) We get all UTM sources we collected with respective numbers of all such events and also the number of all such events that led to the contract, thus measuring the success of every campaign.

-- get the list of all utm sources with their respective 
-- counts of all vs truly signed contracts in given time period
SELECT
  utm,
  SUM(all) AS all,
  SUM(CASE WHEN link=1 THEN all ELSE 0 END) AS link
  (
    SELECT
      r.utm AS utm,
      count(*) AS all,
      (c.phone IS NOT NULL) AS link
      FROM
      (
        SELECT
          JSON_VALUE(data, '$.phone') as phone,
          JSON_VALUE(data, '$.mail') as email,
          JSON_VALUE(data, '$.utm') as utm
          FROM snp_event
          WHERE name='AppointmentRequestAdded'
            AND time BETWEEN '2023-01-01' AND '2023-12-31'
      ) r
      LEFT JOIN
      (
        SELECT
          JSON_VALUE(data, '$.phone') as phone,
          JSON_VALUE(data, '$.mail') as email
          FROM snp_event
          WHERE name='ContractSigned'
            AND time BETWEEN '2023-01-01' AND '2023-12-31'
      ) c
      ON r.phone=c.phone OR r.email=c.email
      GROUP BY r.utm, c.phone IS NULL
  ) s
  ORDER BY s.utm;

Further extension

Current state using cron job

We can measure not only increments and decrements at every transition giving us the ability to show how inputs vary during certain periods of time, but we can add also a current state to the measured data and thus show the development of absolute values.

We also don’t have to wait for a transition to happen. Instead, we can periodically call the cron job and create a special event.

{
  // measured properties
  "company":1,
  "numberOfAppointmentsInStack":12,
  "numberOfRunningContracts":12,

  // data from other columns
  "id":"44ae2d9e-9b38-11ee-995f-e6889ca5fb49",
  "event":"StateCaptured",
  "time":"2023-12-15 11:54:06",
}

Multiple event

We can also discuss the possibility of multiple events. They can occur, for example when having more investors signing the contract. To properly join such an event with other data, it needs to be written more than once with the respective investor’s phone and email. In this case, the ID of the event remains the same, and we can then group them by this ID.

// first event - different phone and email, but the same id, event, time
{
  // measured properties
  "company":1,
  "phone":"ee439e093b3e2505a1aa85884ad840014858baa0818bdb932c7b4c5097867413", // differs from second
  "email":"c056c2f7da5f6904c5411a37823b3fe6bc7ef45e0735d95fca416c96d07c1a65", // differs from second

  // data from other columns
  "id":"44ae2d9e-9b38-11ee-995f-e6889ca5fb49", // the same id of the event
  "event":"ContractSigned",
  "time":"2023-12-15 11:54:06",
}

// second event - different phone and email, but the same id, event, time
{
  // measured properties
  "company":1,
  "phone":"e40550a8238a8815b479080eb49d488e5b81b4c9647a3bd38a39ea5ac1310072", // differs from first
  "email":"ab165d4c4faf750ea726710cbffa7337c9563e96c316c55c4f1d8095cde42506", // differs from first

  // data from other columns
  "id":"44ae2d9e-9b38-11ee-995f-e6889ca5fb49", // the same id of the event
  "event":"ContractSigned",
  "time":"2023-12-15 11:54:06",
}

Summary

We have shown the principles of basic app telemetry using JSON columns of the MariaDB relational database. It’s looking promising, yet we haven’t performed benchmarks on a huge amount of data. It seems to be a good practice to bound the rows involved in statistics using the classic columns like name and time with properly set indexes on the table before we dive into joins using the JSON_VALUE function.

What I’ve found tempting is the fact that you store all your statistics in one database table. You can then get statistical insight into the development of increments as well as absolute values. You can perform a variety of filters and grouping.

Another intriguing fact is that event storage is independent of your actual data, thus giving us the possibility to move it to a real document store like ElasticSearch is in the future.