Architecture of Low-latency Quoting Stream System for Quantitative Trading

High-Level Architecture

The system follows a modular yet monolithic design moving from data ingestion through decoding, calculation, and distribution.

Data Structure

L1 Market Data

While L1 Market Data is the standard for retail platforms and low-frequency strategies; it lacks the granularity required for HFT (High-Frequency Trading).

Typical L1 Payload:

1
2
3
4
5
6
7
8
9
struct L1_Tick {
char symbol[16]; // Fixed-size char array to avoid heap allocation
uint64_t exch_time; // Exchange-generated timestamp (ms)
uint64_t local_time; // Local receipt timestamp for latency monitoring
int64_t bid_price; // Best bid price
int64_t bid_size; // Quantity at best bid
int64_t ask_price; // Best ask price
int64_t ask_size; // Quantity at best ask
};

L2 Market Data

Level 2 Market Data represents the Depth of Book. It is either Market by Order (MBO) or Market by Price (MBP).

In the Shanghai and Shenzhen Stock Exchanges (SSE/SZSE), the Level 2 data is MBO feed. Instead of receiving aggregated price levels, the exchange broadcasts every individual limit order and transaction. Also 20-level deep of the order book is intergrated into the snapshot.

The L2 Market Data types are: Snapshot, Orderbook, Order, and Transaction. Snapshots and orderbooks are typically broadcast at fixed intervals, while orders and transactions are streamed tick-by-tick.

Typical L2 payload:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
constexpr int MAX_SYMBOL_LEN = 16;
constexpr int BOOK_DEPTH = 20;

struct alignas(64) L2_Order {
char symbol[MAX_SYMBOL_LEN];
uint64_t channel_no;
uint64_t seq_num;
uint64_t exch_time;
int64_t price;
int64_t volume;
char side; // Buy/Sell
char ord_type; // Market / Limit / Cancel
};

struct alignas(64) L2_Transaction {
char symbol[MAX_SYMBOL_LEN];
uint64_t channel_no;
uint64_t seq_num;
uint64_t buy_ord_seq;
uint64_t sell_ord_seq;
uint64_t exch_time;
int64_t price;
int64_t volume;
int64_t turnover;
char exec_type; // Filled / Cancelled
char bs_flag; // Outer Buy/Outer Sell
};

struct alignas(64) L2_Orderbook {
char symbol[MAX_SYMBOL_LEN];
uint64_t exch_time;
uint64_t local_time;

// Bids
int64_t bid_price[BOOK_DEPTH];
int64_t bid_volume[BOOK_DEPTH];
int32_t bid_order_count[BOOK_DEPTH];

// Asks
int64_t ask_price[BOOK_DEPTH];
int64_t ask_volume[BOOK_DEPTH];
int32_t ask_order_count[BOOK_DEPTH];
};

struct alignas(64) L2_Snapshot {
char symbol[MAX_SYMBOL_LEN];
uint64_t exch_time;
uint64_t local_time;

int64_t last_price;
int64_t total_volume;
int64_t total_turnover;

int64_t total_bid_volume;
int64_t total_ask_volume;
int64_t weighted_avg_bid_price;
int64_t weighted_avg_ask_price;
};

ML Factors

Matrix of floating values and ternary states.

Components

Quotes Adaptors

The ingestion layer abstracts the complexity of upstream connectivity. The adaptor is loaded at the runtime as a dynamic libarary.

  • QuoteBrokerAdaptors

    The most protocol-intensive component. Upstream sources range from the exchanges’ raw UDP broadcast to TCP-based broker’s APIs. It manages packet sequencing, handling out-of-order, duplicate, or missing.

  • QuoteForwardingAdaptor

    For internal topology, it ingests aggregated or filtered feeds forwarded from other internal network nodes.

  • QuoteReplayAdaptor

    Streams historical data from the DB or a raw PCAP file for backtesting and research.

Stream Decoder & Machine Learning Factor Generator

The Stream Decoder translates raw, fragmented exchange/broker protocols into our internal data structures.

ML Factor Generator is intentional like a black box to the engineering team. The core mathematics remain proprietary to the quantitative team. The generator exposes decoupled interfaces to the quants to inject their models as shared libraries at runtime. The pipeline’s memory layout and IPC routing are specifically optimized to handle two distinct signal types:

  • Regression Outputs: High-precision floating-point values. These typically represent continuous metrics, such as slight deviations from the top-of-book price or real-time fair-value calculations.

  • Classification Outputs: Discrete ternary states (-1, 0, 1). These serve as the directional triggers.

Dispatcher & Sinks

  • IPC sink:

    The critical path for execution, utilizing a Single-Producer, Multiple-Consumer (1-Writer, M-Readers) lock-free shared memory architecture.

    The SHM segment begins with a control header includes the Provider PID and Launch Timestamp. During startup, the Feed Handler performs a mandatory check on the stored PID & launch time. If an active process is detected, the new instance aborts to prevent dual-writer memory corruption. The launch timestamp allows downstream readers to detect provider restarts, signaling them to clear stale state and re-synchronize.

  • LAN sink:

    Handles external distribution. It routes data over the network to other colocation facilities

  • Log sink:

    Asynchronously persists the stream into binary files for post-trade database ingestion.

Further Questions:

The quoting system is a part of the latency arms race. It is unlikely any competitive HFT system will be open sourced.

  1. How can we further reduce latency?

Beyond software optimization, we look at Kernel Bypass (using Solarflare Onload or DPDK) to move networking into user-space, avoiding the overhead of the Linux kernel stack. Additionally, CPU Pinning and isolating cores via isolcpus prevents the OS scheduler from interrupting the critical path. For extreme-low-latency requirements, such as HFT arbitrage, the entire system can be offloaded to FPGA hardware.

  1. Why Shared Memory, and how is data structured within it?

Shared Memory (SHM) allows multiple processes to access the same physical RAM, eliminating the copy-overhead of local sockets and the expensive context switching between kernel and user space. We structure this as a Lock-Free Single-Producer, Multiple-Consumer (SPMC) ring buffer using Atomic Sequence Numbers.

Setup STM32 Dev Environment on Windows Laptop with Virtual Machine

Software Spec

Host OS: Windows 11 Pro 24H2

VMware Workstation 17 Pro (17.5.2 build-23775571)

Guest OS: Ubuntu 24.04 LTS, Debian 12 (Failed), Debian 13(Failed)

1. Setup Guest OS

First, I found no Internet in the guest OS, need to manually turn on the related service on the Windows host.

I had problems with open-vm-tools on Debian 12. The auto resolution, and the sharing folder features does not work. Try to manual install the VMWare tool with inserting ISO, resulted in segment fault when compiling.

Tried same approach with Debian 13 released few days ago, does not work.

Try to install Ubuntu 24 LTS as the guest OS instead. VMWare detected the OS, will use the easy install. open-vm-tools and other features are automatically set up.

2. Setup the Cross Compile Toolchain

Update the system first.

1
apt update && apt upgrade

Install the required packages.

1
apt install vim p7zip-full git make cmake stlink-tools minicom

Download and extract the cross-compile toolchain to the ~/opt/ folder.

1
2
3
4
5
mkdir -p ~/opt && cd ~/opt

wget https://developer.arm.com/-/media/Files/downloads/gnu/14.2.rel1/binrel/arm-gnu-toolchain-14.2.rel1-x86_64-arm-none-eabi.tar.xz

tar xvf arm-gnu-toolchain-14.2.rel1-x86_64-arm-none-eabi.tar.xz

As for the meaning of the suffix arm-none-eabi: ARM is the target architecture, none means no vender, EABI stands for Embedded Application Binary Interface.

Add the path environment to the ~/.bashrc

1
export PATH=$PATH:/home/alice/opt/arm-gnu-toolchain-14.2.rel1-x86_64-arm-none-eabi/bin

Update the environment and verify the installation is successful by printing the version.

1
source ~/.bashrc

3. Setup the Serial Port in the Guest OS

Plug in the STM32 board and link that to the guest OS
![[/images/Setup STM32 Dev Environment on Windows Laptop with Virtual Machine-res/setup_usb_reflex.png]]
The drive file should show up in the /dev device mapping folder as /dev/ACM0

Related read: What is the difference between /dev/ttyUSB and /dev/ttyACM?

4. Setup the STM32CubeIDE

After installed the IDE in the ~/opt folder. Launch the IDE.

I have a 3k high resolution screen on my laptop. And the IDE’s font does not auto-scale with the OS settings.

Go to the Window -> Display Size -> Large to manual change the font setting.

Another bug appears. Some characters become invisible.

Which is solved by disable 3D acceleration

5. Try to Compile

Update the CMake file generated by the STM32CubeIDE

1
2
3
4
5
# Setup compiler settings
set(CMAKE_C_COMPILER /home/alice/opt/arm-gnu-toolchain-14.2.rel1-x86_64-arm-none-eabi/bin/arm-none-eabi-gcc)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED ON)
set(CMAKE_C_EXTENSIONS ON)

Try to compile with the following commands

1
2
3
4
mkdir -p build
cd build
cmake ../
make

Compile failed

To Be Continued…

Reference

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

libev Implementation Analysis

libev is basically an event loop, which watches and dispatches the target events. It includes the following parts:

  • Event Handle
  • IO Multiplexing
  • Timer
  • Event Loop

1. Event Handle ev_watcher

libev uses ev_watcher to monitor and manage various events. Each event type has a corresponding watcher type. Common handles include:

  • ev_io : monitor fd R/W events
  • ev_timer: timer, support both one-time and repeating time
  • ev_signal: signal event, E.g. SIGINT

The above can be referred to as sub-classes of ev_watcher:

1
2
3
4
5
6
7
typedef struct ev_watcher { 
int active;
int pending;
int priority;
void *data;
void (*cb)(struct ev_loop *loop, struct ev_watcher *w, int revents);
} ev_watcher;

An example of registering a readable standard input event:

1
2
3
ev_io stdin_watcher; 
ev_io_init(&stdin_watcher, stdin_cb, STDIN_FILENO, EV_READ);
ev_io_start(loop, &stdin_watcher);

ev_io_init is a macro that expands to the following code, mainly for initializing the watcher, such as setting the callback, fd, events, etc.:

1
2
3
4
5
6
7
8
9
10
11
do {
do {
((ev_watcher *)(void *)((&stdin_watcher)))->active = ((ev_watcher *)(void *)((&stdin_watcher)))->pending = 0;
((ev_watcher *)(void *)(((&stdin_watcher))))->priority = (0);
(((&stdin_watcher)))->cb = ((stdin_cb)), memmove(&((ev_watcher *)(((&stdin_watcher))))->cb, &(((&stdin_watcher)))->cb, sizeof((((&stdin_watcher)))->cb));
} while (0);
do {
((&stdin_watcher))->fd = ((0));
((&stdin_watcher))->events = ((EV_READ)) | EV__IOFDSET;
} while (0);
} while (0);

ev_io_start mainly modifies the anfds and fdchanges arrays. For watchers that are already active, it returns directly. If it is not active, it sets the active and priority and uses the head insertion method to insert it into the linked list of the corresponding fd in anfds, and sets the current watcher as the head:

1
2
3
4
5
6
7
wlist_add(&((loop)->anfds)[fd].head, (WL)w);

static __inline__ void
wlist_add(WL *head, WL elem) {
elem->next = *head;
*head = elem;
}

Insert the fd into the end of the fdchanges array:

1
2
++((loop)->fdchangecnt);
((loop)->fdchanges)[((loop)->fdchangecnt) - 1] = fd;

Set the flag w->events &= ~EV__IOFDSET;

2. IO Multiplexing

libev supports select, poll, epoll

1
2
void (*backend_modify)(struct ev_loop *loop, int fd, int oev, int nev);
void (*backend_poll)(struct ev_loop *loop, ev_tstamp timeout);

Taking epoll as an example, backend_modify is epoll_modify, and backend_poll is epoll_poll. epoll_modify is the epoll_ctl commonly used by everyone. When an event changes, use EPOLL_CTL_MOD, otherwise use EPOLL_CTL_ADD.

1
epoll_ctl(backend_fd, oev && oldmask != nev ? EPOLL_CTL_MOD : EPOLL_CTL_ADD, fd, &ev)

The execution logic of epoll_poll :

  • epoll_wait obtains the list of ready events (loop->epoll_events)
  • For ready events, execute fd_event and put them into the pendings array of the corresponding priority in the loop.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    static __inline__ void
    fd_event(struct ev_loop *loop, int fd, int revents) {
    ANFD *anfd = ((loop)->anfds) + fd;
    if (__builtin_expect((!!(!anfd->reify)), (1)))
    fd_event_nocheck(loop, fd, revents);
    }

    static __inline__ void
    fd_event_nocheck(struct ev_loop *loop, int fd, int revents) {
    ANFD *anfd = ((loop)->anfds) + fd;
    ev_io *w;
    for (w = (ev_io *)anfd->head; w; w = (ev_io *)((WL)w)->next) {
    int ev = w->events & revents;
    if (ev)
    ev_feed_event(loop, (W)w, ev);
    }
    }

    void __attribute__((__noinline__))
    ev_feed_event(struct ev_loop *loop, void *w, int revents) {
    w_->pending = ++((loop)->pendingcnt)[pri];
    ((loop)->pendings)[pri][w_->pending - 1].w = w_;
    ((loop)->pendings)[pri][w_->pending - 1].events = revents;
    }

3. Timer

The internal implementation of the timer uses a Binary Heap and a Quaternary Heaps (for better cache efficiency).

1
2
3
4
#define DHEAP 4
#define HEAP0 (DHEAP - 1) /* index of first element in heap */
#define HPARENT(k) ((((k) - HEAP0 - 1) / DHEAP) + HEAP0)
#define UPHEAP_DONE(p,k) ((p) == (k))

Note that HEAP0 is used as the first element, that is, the offset. The internal implementation of the heap is upheap and downheap, which are relatively simple and will not be described here. Its structure is:

1
2
3
4
5
6
typedef ev_watcher_time *WT;

typedef struct {
ev_tstamp at;
WT w;
} ANHE;

The heap is sorted according to at, which is the expiration time of the timer. The heap sort is sorted according to the value of at from small to large.

4. Event Loop - ev_loop

1
2
struct ev_loop *loop = EV_DEFAULT;
ev_run(loop, 0);

EV_DEFAULT will call the ev_default_loop function to initialize an ev_loop structure, which mainly initializes the loop structure through loop_init, such as choosing epoll or poll, select. The most critical point: epoll_init occurs when the above loop is created.

ev_run is more complicated and mainly does the following:

  • fd_reify: Traverse the fdchanges array mentioned earlier, take out the linked list from anfds according to the fd, traverse all event linked lists, get all events, and determine whether the latest event is consistent with the previous old event (events). If they are inconsistent, call epoll_modify (or add the event for the first time).
    1
    2
    3
    4
    5
    typedef struct {
    WL head; // event linked list
    unsigned char events;
    // other
    } ANFD;
  • backend_poll: See the implementation of backend_poll above, it will fetch the ready events and put the events into the pendings array, that is, put them at the end of the corresponding priority queue.
  • timer_reify: If the event has not expired, no processing is done. Otherwise, the executes the following:
    • Call ev_timer_stop to clear the timer that has no repeat set.
    • Set the repeat timer with ev_timer_init() call
  • invoke_cb: it is ev_invoke_pending by default, the callback will be take out of from the priority queue then be triggered respectively.

Reference

libev | Github

Quaternary Heaps | University of Waterloo ECE 250

Leetcode 432. All O'one Data Structure

Design a data structure to store the strings’ count with the ability to return the strings with minimum and maximum counts.

Implement the AllOne class:

AllOne() Initializes the object of the data structure.
inc(String key) Increments the count of the string key by 1. If key does not exist in the data structure, insert it with count 1.
dec(String key) Decrements the count of the string key by 1. If the count of key is 0 after the decrement, remove it from the data structure. It is guaranteed that key exists in the data structure before the decrement.
getMaxKey() Returns one of the keys with the maximal count. If no element exists, return an empty string "".
getMinKey() Returns one of the keys with the minimum count. If no element exists, return an empty string "".

Note that each function must run in O(1) average time complexity.

Analysis

This problem requires accessing random value and boundries in constant time.

Hence, a cache replacement algorithm pattern should be used:

  1. Hashing map ensures inc and dec runs in constant time. Its pairs’ value points to another container.
  2. To ensure getting max and min value in constant time, ordered linear data structure such as: std::vector, std::list, should be used. std::list is finnaly chosen because its low cost to insert, delete, & relocate elements.

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class AllOne {
public:
struct data_t {
std::string key;
int count;
};

AllOne() {}

inline void inc(std::string key) {
// std::cout << "inc(" << key << ")\n";
if(auto it = store_.find(key); it != store_.end()) {
it->second->count++;

// move the key to the rear side
auto dst = std::next(it->second);
while(dst != ordered_seq_.end() && dst->count < it->second->count) {
dst++;
}
ordered_seq_.splice(dst, ordered_seq_, it->second);
}
else {
ordered_seq_.push_front(data_t{std::move(key), 1});
store_.insert({ordered_seq_.begin()->key, ordered_seq_.begin()});
}

// print_list();
}

inline void dec(std::string key) {
// std::cout << "dec(" << key << ")\n";
auto it = store_.find(key);
if(it->second->count == 1) {
// erase the key
ordered_seq_.erase(it->second);
store_.erase(it);
}
else {
it->second->count--;

if(it->second != ordered_seq_.begin()) {
// move the key to the front side
auto dst = std::next(it->second, -1);
while(dst != ordered_seq_.begin() && dst->count >= it->second->count) {
dst--;
}
if(dst->count >= it->second->count)
ordered_seq_.splice(dst, ordered_seq_, it->second);
}
}

// print_list();
}

inline string getMaxKey() {
if(store_.empty())
return "";
return ordered_seq_.rbegin()->key;

}

inline string getMinKey() {
if(store_.empty())
return "";
return ordered_seq_.begin()->key;
}

// void print_list() {
// for(auto x : ordered_seq_) {
// std::cout << "[" << x.key << "=" << x.count << "] ";
// }
// std::cout << "\n";
// }

std::list<data_t> ordered_seq_;
std::unordered_map<std::string_view, std::list<data_t>::iterator> store_;
};