Data Infrastructure

Provable Completeness: Guaranteeing Zero Data Loss in Trade Collection from Crypto Exchanges

Date

Jan 12, 2026

Author

Dmitry Prokofyev, Koinju CTO

In crypto trading, having complete data is not optional - it is foundational. A single missing data record can distort volumes, P&L, risk exposures, and regulatory reports. Yet, in high-throughput feeds, gaps are inevitable unless reconciled- so robust detection and timely remediation are essential for accurate reporting and control.


The Data Quality Challenge

Collecting complete market data from cryptocurrency exchanges is harder than it sounds. On paper, it's simple: connect to a WebSocket, receive trades, store them. In practice, we wrestle with unreliable connections, massive data volumes (millions of trades per day across hundreds of markets) and the need to deliver correct data to consumers in real-time.

Most data providers operate on a "best effort" basis. They stream data, maybe log disconnections, and hope for the best. When a client asks "do you have all the trades for BTC-USD on January 15th?", the honest answer is often "probably."

At Koinju, we took a different approach. We wanted provable completeness: the ability to mathematically demonstrate that for any given market and time range, we have captured every single trade. No gaps, no missing data, no "probably."

This post focuses on our reconciliation mechanism - the core innovation that enables provable completeness in real-time. We'll walk you through a real scenario, demonstrating how our different system components work together and interact with each other to deliver a robust reconciliation solution.

We use Coinbase as the implementation example, but the principles apply to any exchange that provides sequential trade IDs (e.g. Kraken, Bitstamp). For exchanges without sequential IDs, we use probabilistic candle-based validation (a topic for a future post).

The Scenario: A Network Interruption

Let's start with a real situation. Our WebSocket collector is streaming BTC-USD trades from Coinbase, when a brief network interruption occurs. The trades are inserted in the following order:

INSERT INTO public_data.trade (exchange, market, side, quantity, price, timestamp, trade_id, fill_trade)
VALUES
  ('coinbase', 'BTC-USD', 'sell', 0.00512, 43250.00, '2025-12-16T10:10:43.870', '7175159', false),
  ('coinbase', 'BTC-USD', 'buy',  0.00128, 43251.00, '2025-12-16T10:10:43.872', '7175160', false),
  ('coinbase', 'BTC-USD', 'sell', 0.00256, 43249.50, '2025-12-16T10:10:43.874', '7175161', false),
  -- Network blip: trade 7175162 is LOST
  ('coinbase', 'BTC-USD', 'buy',  0.00064, 43252.00, '2025-12-16T10:10:43.880', '7175163', false)

Coinbase assigns sequential integer IDs to trades. The jump from 7175161 to 7175163 points to a clear issue: trade 7175162 is missing. Our application didn't crash, neither did it log an error- it simply never received that trade.

How do we detect this? How do we fix it? And how do we prove it's fixed?

Step 1: Storing Trades

All trades land in ClickHouse's trade table:

CREATE TABLE public_data.trade
(
    exchange     LowCardinality(String),
    market       LowCardinality(String),
    side         LowCardinality(String),
    quantity     Decimal(76, 20),
    price        Decimal(76, 20),
    timestamp    DateTime64(9, 'UTC'),
    trade_id     String,
    fill_trade   Bool,  -- false=WebSocket, true=REST backfill
    created_at   DateTime64(9, 'UTC') DEFAULT now64(9, 'UTC')
)
ENGINE = MergeTree()
ORDER BY (exchange, market, timestamp)

The fill_trade boolean will become important later as it distinguishes original WebSocket trades from backfilled ones.

At this point, our table has 4 trades, while in reality it should have 5. The gap exists silently.

Step 2: Detecting the Gap (T+5 seconds)

A Refreshable Materialized View runs every 5 seconds, scanning for sequence breaks:

CREATE MATERIALIZED VIEW public_data.sequence_discontinuity_mv
REFRESH EVERY 5 SECOND APPEND
TO public_data.sequence_discontinuity
AS
SELECT * EXCEPT diff_with_start_id
FROM (
    SELECT
        market,
        exchange,
        lagInFrame(trade_id, 1, trade_id - 1) OVER w AS start_id,
        trade_id AS end_id,
        lagInFrame(timestamp, 1, timestamp) OVER w AS start_timestamp,
        timestamp AS end_timestamp,
        multiIf(
            diff_with_start_id > 1, 'gap',
            duplicated_trades = 1, 'duplicate',
            'unknown'
        ) AS type,
        trade_id - start_id AS diff_with_start_id,
        if(diff_with_start_id > 0, diff_with_start_id - 1, 0) AS nb_missing_trades,
        toUInt8(diff_with_start_id = 0) AS duplicated_trades
    FROM (
        SELECT market, exchange, toInt64OrZero(trade_id) AS trade_id, timestamp
        FROM public_data.trade
        WHERE timestamp > now64(9, 'UTC') - INTERVAL 10 MINUTE
    )
    WINDOW w AS (PARTITION BY market, exchange ORDER BY trade_id)
)
WHERE diff_with_start_id != 1

How It Detects Our Gap

The key is the lagInFrame() window function. For each trade, it looks at the previous trade_id in the same market:

When the MV processes trade 7175163, it calculates:

  • Previous trade_id (via lagInFrame): 7175161

  • Current trade_id: 7175163

  • Difference: 7175163 - 7175161 = 2

A difference of 2 means one trade is missing. The MV inserts a row into sequence_discontinuity:

SELECT market, start_id, end_id, type, nb_missing_trades, resolved
FROM public_data.sequence_discontinuity
WHERE market = 'BTC-USD';

-- market  | start_id | end_id  | type | nb_missing_trades | resolved
-- BTC-USD | 7175161  | 7175163 | gap  | 1                 | false

The gap is now detected and recorded without any application code, without any scheduled job, purely within ClickHouse.

The Sequence Discontinuity Table

Detected gaps are stored in a table designed for self-healing:

CREATE TABLE public_data.sequence_discontinuity
(
    market            LowCardinality(String),
    exchange          LowCardinality(String),
    start_id          Int64,            -- trade_id before the gap
    end_id            Int64,            -- trade_id after the gap
    start_timestamp   DateTime64(9, 'UTC'),
    end_timestamp     DateTime64(9, 'UTC'),
    type              LowCardinality(String),  -- 'gap' or 'duplicate'
    nb_missing_trades Int64,
    detected_at       DateTime64(9, 'UTC') DEFAULT now64(9, 'UTC'),
    resolved_at       Nullable(DateTime64(9, 'UTC')) DEFAULT null,
    resolved          Boolean MATERIALIZED resolved_at IS NOT NULL
)
ENGINE = ReplacingMergeTree(resolved)
ORDER BY (exchange, market, start_id, end_id)

The ReplacingMergeTree(resolved) engine is crucial: when we later mark a gap as resolved, ClickHouse will deduplicate and keep only the resolved version.

Step 3: Resolving the Gap (T+6 seconds)

Our cryptovalidator service polls for unresolved gaps:

SELECT * FROM sequence_discontinuity
WHERE type = 'gap' AND NOT

It finds our gap and fetches the missing trade from Coinbase's REST API:

Coinbase returns trade 7175162. The validator inserts it with fill_trade=true:

INSERT INTO public_data.trade
  (exchange, market, side, quantity, price, timestamp, trade_id, fill_trade)
VALUES
  ('coinbase', 'BTC-USD', 'sell', 0.00192, 43250.50, '2025-12-16T10:10:43.876', '7175162', true)

The missing trade is now in our table. But how do we know the gap is truly fixed?

Step 4: Verifying Resolution (T+8 seconds)

A second Refreshable Materialized View continuously checks if gaps have been filled:

CREATE MATERIALIZED VIEW public_data.filled_sequence_discontinuity_mv
REFRESH EVERY 2 SECOND APPEND
TO public_data.sequence_discontinuity
AS
SELECT
    * EXCEPT resolved_at,
    now64(9, 'UTC') AS resolved_at
FROM public_data.sequence_discontinuity sd
WHERE sd.type = 'gap' AND NOT sd.resolved
AND (
    SELECT count(DISTINCT toInt64OrZero(trade_id))
    FROM public_data.trade t
    WHERE t.exchange = sd.exchange
      AND t.market = sd.market
      AND toInt64OrZero(t.trade_id) > sd.start_id
      AND toInt64OrZero(t.trade_id) < sd.end_id
)

This correlated subquery syntax is available in Clickhouse 25.10 and later.

This MV asks: "For each unresolved gap, do all the missing trade_ids now exist?"

For our gap (start_id=7175161, end_id=7175163, nb_missing_trades=1), it counts trades where trade_id > 7175161 AND trade_id < 7175163. It finds one trade (7175162) - which equals nb_missing_trades.

The gap is filled! The MV inserts a new row with resolved_at = now():

Self-Healing via ReplacingMergeTree

Now we have two rows for the same gap:

  • Original: (start_id=7175161, end_id=7175163, resolved=false)

  • New: (start_id=7175161, end_id=7175163, resolved=true, resolved_at='2025-12-16 10:10:52')

Because we use ReplacingMergeTree(resolved) with the same ORDER BY key, ClickHouse keeps the row where resolved=true during background merges.

Query with FINAL to see the deduplicated state:

SELECT market, start_id, end_id, resolved, resolved_at
FROM public_data.sequence_discontinuity FINAL
WHERE market = 'BTC-USD';

-- market  | start_id | end_id  | resolved | resolved_at
-- BTC-USD | 7175161  | 7175163 | true     | 2025-12-16 10:10:52.123456789

The entire cycle,detection, resolution, verification,completed in under 10 seconds, entirely automatically.

Proving Completeness

For any market with sequential trade_ids, we can at any time prove completeness:

SELECT
    market,
    min(toInt64OrZero(trade_id)) AS first_id,
    max(toInt64OrZero(trade_id)) AS last_id,
    uniqExact(trade_id) AS actual_trades,
    last_id - first_id + 1 AS expected_trades,
    actual_trades = expected_trades AS is_complete
FROM public_data.trade
WHERE exchange = 'coinbase'
GROUP BY

If is_complete = true, we have every trade in the sequence. No probabilistic sampling or unreliable volume-candle based on mathematical certainty.

Key Differentiators

Database-Native Gap Detection

Traditional approaches poll for gaps using scheduled jobs in application code. Our approach pushes detection into the database itself:

  • No external scheduler to manage or monitor

  • Scales with ClickHouse, not with application instances

  • Independent of acquisition layerβ€”works regardless of how trades arrive or if validation infrastructure restarts

Self-Healing Architecture

The combination of Refreshable MVs and ReplacingMergeTree creates a system that:

  • Detects its own problems

  • Triggers its own repairs

  • Verifies its own fixes

  • Maintains an audit trail of every gap and resolution

Conclusion

By pushing gap detection into ClickHouse's refreshable materialized views, we've built a reconciliation system that:

  1. Detects sequence breaks within seconds of occurrence

  2. Repairs gaps automatically via REST API backfill

  3. Verifies resolution through continuous monitoring

  4. Proves completeness through sequence validation

This approach decouples verification from acquisition, allowing us to scale and maintain high data quality with minimal operational overhead.

The system works for any exchange with sequential trade IDs. For exchanges without this property, we use probabilistic candle-based validation – a topic for a future post.

In the next post, we'll explore how we aggregate these trades into candles using ClickHouse's aggregation functions, and how clients query this data directly via SQL.

This is Part 1 of a 3-part series on cryptocurrency market data infrastructure. Part 2 will cover storage and aggregation of trades into candles and indexes with ClickHouse. Part 3 will cover data distribution and client access.

Want to see it in action?

Our API provides provable completeness for 25+ exchanges.

  • For developers: Try our API β†’ docs.koinju.io

  • For fund administrators: See how we handle your reconciliation β†’ Book a demo!

  • Questions on your specific use case? β†’ Contact us!

DISCLAIMER:
Koinju is a product of Maarkt, a registered Benchmark administrator ( nΒ° BMR2021000001 ) under the Art. 34 of the "Benchmark" regulation ((EU) 2016/1011), authorized and regulated by the French Financial Markets Authority πŸ‡«πŸ‡·. All information and data available on our Website and related Services is provided for information purposes only, and should not be construed as any kind of advice. The Website, its Content and related Services are provided "as is" and "as available", and do not commit the Company to respond to the User's specific need and/or situation. MAARKT cannot be held responsible for any missing or incorrect information. Data provided on the Website is based on unrelated third-parties' data. MAARKT cannot guarantee neither the accuracy, reliability and completeness of these third-parties' data nor related manipulation risks. You accept all risks associated with the use of the Content on the Website provided by our Services and are therefore fully responsible for such use and the consequences that may result.
DISCLAIMER:
Koinju is a product of Maarkt, a registered Benchmark administrator ( nΒ° BMR2021000001 ) under the Art. 34 of the "Benchmark" regulation ((EU) 2016/1011), authorized and regulated by the French Financial Markets Authority πŸ‡«πŸ‡·. All information and data available on our Website and related Services is provided for information purposes only, and should not be construed as any kind of advice. The Website, its Content and related Services are provided "as is" and "as available", and do not commit the Company to respond to the User's specific need and/or situation. MAARKT cannot be held responsible for any missing or incorrect information. Data provided on the Website is based on unrelated third-parties' data. MAARKT cannot guarantee neither the accuracy, reliability and completeness of these third-parties' data nor related manipulation risks. You accept all risks associated with the use of the Content on the Website provided by our Services and are therefore fully responsible for such use and the consequences that may result.
DISCLAIMER:
Koinju is a product of Maarkt, a registered Benchmark administrator ( nΒ° BMR2021000001 ) under the Art. 34 of the "Benchmark" regulation ((EU) 2016/1011), authorized and regulated by the French Financial Markets Authority πŸ‡«πŸ‡·. All information and data available on our Website and related Services is provided for information purposes only, and should not be construed as any kind of advice. The Website, its Content and related Services are provided "as is" and "as available", and do not commit the Company to respond to the User's specific need and/or situation. MAARKT cannot be held responsible for any missing or incorrect information. Data provided on the Website is based on unrelated third-parties' data. MAARKT cannot guarantee neither the accuracy, reliability and completeness of these third-parties' data nor related manipulation risks. You accept all risks associated with the use of the Content on the Website provided by our Services and are therefore fully responsible for such use and the consequences that may result.