The data architecture that works 99% of the time

Image for post
Image for post
  1. Real time financial applications
  2. Video processing

The Components

1 . The Front-End

The Front End

The “front-end” serves as the routing for the rest of your data pipeline. It is scalable and asynchronous, which means it can serve thousands of potential sources which generate data, which could be sensors, user connections, etc. Even though Kafka and other message brokers have APIs for producers, I don’t like the idea of having producers directly access the message bus. There are several complications that arise. This also makes it possible for an individual user web session to be a producer in your data-pipeline. By having a front end, you can enforce specific standards, and reject a producer if it doesn’t meet the standards, For this, I like Tornado, but there are other options.

What it does:

  • Producers push data to this front-end via a webservice call. This never “fetches”
  • It should “wrap” data in an envelope that can be understood by every agent, worker, etc. in your system. (example provided below)
{
“data_id”: fdc7d2f2–062c-4683–9a41–1d7912723c1b,
“source”: “generator-10”,
“created”: “2020–05–05 00:11:11”,
“data”: UG93ZXIgZ2Vu..
}
  • The front end can also control security somewhat because it will be the only point in your data pipeline that is accessible to the outside world. You can add additional layers of security here, such as authentication of each client, etc.
  • It should place data in the stream. Every item goes in the stream

The Stream

The stream serves as a source that allows concurrent processing from multiple real-time applications.

What it does

  • It provides multiple channels for data routed for your front end. For example your demand data are in one Kafka topic: “demand” and your generator data are in another Kafka topic: “generation”.
  • The stream can work in input and output as well. For example, your demand data are consumed by a machine learning algorithm and then the predictions are piped into another topic ‘demand-prediction’, which is consumed by another agent which sends messages to your generation station.

Agents

Here’s where this entire thing gets interesting. Because you have standardized all your data, and any agent can digest any message and do something with it. You can have 1 agent for every topic, or 100. An agent can be a complex application or a simple lambda function.

Some properties:

Each agent has a unique name or id. You decide the naming conventions, but it’s important that you can identify agents uniquely. Bonus points if you incorporate versions into your agent-naming

  • Agents are single purpose: Each agent should perform a single task on a single data type coming from your stream. Don’t multipurpose your agents since they’re cheap to deploy. That way, when an input data type changes, you don’t have to parse and update all your agents. You just go directly to the agents which operate on that data type. This is important when your agents start to number in the 10s or 100s.
  • They’re mostly stateless: If an agent fails, you can just spin up another one that does the exact same thing! Your system becomes extremely fault tolerant with the right monitoring. I say mostly stateless, because occasionally I’d have an agent collect data for a bit prior to performing it’s calculation, let’s say a 20 period moving average.
  • They only store data if they create data: An agent only stores data if it produces something unique. That is, don’t have the RealtimeDashboardAgent store the raw message. Create a separate S3StorageAgent that stores all records in partitioned s3 buckets.

What they do

  • Deserialize the data into something it understands

Storage

This is highly dependent on the scale and maturity of your organization. Do you have data scientists that need to query data regularly? Do you need to store everything, or just aggregated data?

Trace

Your trace cache should be a single source of truth for your agent framework. It is organized by the “data-id”, or the unique identifier that you use to identify any data produced in your system. Because you collect potentially thousands of producers into a single front-end, you can ensure that the data and identifiers are standardized. I like Redis for this because it provides O(1) guarantees for the operations I listed, but other options exist.

HSET fdc7d2f2–062c-4683–9a41–1d7912723c1b created timestamp
HSET fdc7d2f2–062c-4683–9a41–1d7912723c1b agent_1503 timestamp

Instrumentation

Image for post
Image for post

Written by

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store