Data Costs: Are You a Victim of Your Own Success?

A high-frequency analytical query running against raw object storage scans the entire historical dataset every cycle. The cost compounds with your data growth while the insight stays identical. This post is the architectural fix.

Olivia Storelli

Jun 29, 2026 · 11 Min read

Even in an era of AI and automation, the most valuable asset we have is judgment. We talk often about saving money in the cloud, but what does that actually mean in practice? This post is about small architectural decisions that compound cost at scale, and the structural fix when they do. If you have any pipelines running today against raw object storage on a fixed cadence, there is a good chance one of them is in here.

A case study in insight unit economics

Imagine the scenario. Your high-traffic e-commerce platform has just survived a record-breaking sales weekend. Traffic peaked, transactions hit an all-time high, and the security team flagged every bot and DDoS attempt cleanly using real-time HTTP log monitoring. The executive suite is celebrating.

Thirty days later, the cloud invoice arrives.

Instead of the economies of scale the hyperscalers all promise, your data infrastructure bill has grown linearly with your analytical data. Worst of all, you are paying ten times more to compute the exact same 15-minute analysis the team has been looking at for months.

This is the classic data platform problem: being penalised by your own operational history. It is not a query problem. It is the direct result of a high-frequency, stateless execution loop sitting on top of a suboptimal storage abstraction. Architectural decisions that look trivial on paper produce orders-of-magnitude differences on the cloud invoice.

Monitoring is a data problem at scale

In enterprise e-commerce, downtime or a security breach is millions in lost revenue and a long PR tail. To catch DDoS attacks, credential stuffing, and bot traffic before they do real damage, you have to monitor real-time HTTP logs continuously.

Most enterprises rely on high-end SIEM (Security Information and Event Management) and SOAR platforms for security orchestration. The catch is that these tools almost universally charge based on data ingestion volume, which becomes unsustainably expensive at HTTP-log volumes. Ingesting 100% of raw, noisy, unfiltered HTTP logs directly into a SIEM is neither economical nor sensible.

The pipeline approach

To preserve capital, the architectural plan is simple:

Land raw, compressed log data each day into inexpensive object storage (Google Cloud Storage, AWS S3).
Analyse it in near real-time, on a 15-minute cadence, inside a cloud data warehouse.
Push only the high-value, actionable alerts to the expensive SIEM.

To prove the business case quickly and without burning a lot of engineering time, teams almost always set up a quick-and-dirty proof of concept at 10% log volume. That is a fast, cheap way to get stakeholders aligned on the data they want to see and what it means.

Where the convenience pattern ends

To get the proof of concept live in a single afternoon, the path of least resistance is to point data warehouse external tables (Schema-on-Read) directly at the raw storage bucket. No ETL pipelines, no data orchestration, no upfront schema optimisation. At 10% data volume in initial testing, this setup runs exceptionally fast and costs pennies.

So you flip the switch to 100% production volume. Within a few hours, the linear cost curve starts to break the unit economics on every insight the pipeline produces.

The zero-ETL convenience hits its limit because the cost scales linearly with the full scan it has to do, and security monitoring needs frequent checking. If insights are produced every 15 minutes, the pipeline executes the exact same query 96 times a day. The architecture has no state management layer, so it has no memory of what it has already processed. Every run re-scans every log file that has ever landed in the bucket.

This low-friction convenience behaves the same way across every major hyperscaler:

Google Cloud (BigQuery). External tables over GCS force BigQuery to scan unoptimised, non-columnar files on every run, billing at the full on-demand rate (currently $6.25 per TiB scanned in US multi-region) instead of leveraging compressed, native columnar formats (Google Cloud, 2026).
AWS (Athena). Works on an identical on-demand model, charging $5.00 per TB scanned from S3 (Amazon Web Services, 2026). As historical logs accumulate in the bucket, Athena hits a literal pricing wall.
Snowflake. External tables bypass the platform’s native micro-partitioning, proprietary sorting, and localised metadata caching (Snowflake, 2026). Queries take significantly longer to execute, forcing virtual warehouses to stay active longer and burn through credit hours.

By coupling a low-overhead onboarding abstraction to a high-frequency production system, you tie the cost of a static, identical insight to an exponentially growing historical dataset.

The math: scaling penalty vs flat-line

The analytical value of a 15-minute window does not change. The threat profile does not change. Yet look at how the unit economics decay when you couple a fixed insight to an unpartitioned, historical dataset. Over 100 days, with a small 50GB-per-day log volume and a standardised data warehouse cost model:

The wrong way: the stateless external scan

The system runs a full scan of the growing historical storage 96 times a day just to look at the latest drop in the bucket.

Day 1. 50GB in storage. You scan 4.8 TB/day. Cost: $24/day (1x base cost).
Day 30. 1.5TB in storage. You scan 144 TB/day. Cost: $720/day (30x base cost).
Day 100. 5TB in storage. You scan 480 TB/day. Cost: $2,400/day (100x base cost).

By Day 100, you are paying 100x more for the exact same 15-minute alert you produced on Day 1. The infrastructure bill is actively penalising you for your own history.

The right way: stateful incremental design

With a state management layer or strict partition pruning, the compute engine isolates only the new data that arrived in the 15-minute window (about 520MB per run). History is stored for deep audits but is never scanned for operational alerts.

Day 1. 50GB in storage. You scan 50 GB/day. Cost: $0.25/day (1x base cost).
Day 30. 1.5TB in storage. You scan 50 GB/day. Cost: $0.25/day (1x base cost).
Day 100. 5TB in storage. You scan 50 GB/day. Cost: $0.25/day (1x base cost).

On Day 100, the cost is exactly 1x. It is flat and identical to Day 1. The business growth and data history have been successfully decoupled from the cloud compute invoice.

[The wrong way: stateless external scan]
Day 1:    $24/day      (1x)
Day 30:   $720/day     (30x)    Cost is bleeding meaningfully
Day 100:  $2,400/day   (100x)   Cost is a structural problem

[The right way: stateful incremental design]
Day 1:    $0.25/day    (1x)
Day 30:   $0.25/day    (1x)
Day 100:  $0.25/day    (1x)     Flat across the full window

Listing 1. Side-by-side: stateless external scan vs stateful incremental design over 100 days.

The hidden multiplier: it is never just one pipeline

If this were the only pipeline running in the enterprise, a $2,400/day bleed would be a painful but isolated leak.

It is never just one.

Modern enterprise data platforms easily run 100+ pipelines concurrently: customer behavioural tracking, transactional updates, inventory syncs, third-party API ingestions. When a data organisation lacks a strict framework for intentional storage design and state management, the quick-and-dirty external-table pattern gets copy-pasted across the estate. It becomes the unwritten standard for speed.

If just 20% of production pipelines carry this exact structural oversight, you are not looking at a minor cloud overage. You are looking at a compounding, multi-million-dollar architectural deficit that quietly erodes company margins behind the scenes.

The CXO lens: system latency vs decision latency

When cloud data invoices spike, the traditional executive response is to reach for blunt accounting instruments. Budget caps, restricted compute privileges, multi-layered approval chains. These tactics are counterproductive. They introduce friction, kill developer velocity, and ultimately cost the business more in lost time and delayed features than they save in cloud credits.

To manage cloud costs without stalling growth, leadership has to understand the structural disconnect between two latencies:

System latency. How fast the technical infrastructure can process a batch of logs and trigger an alert (10 seconds).
Decision latency. How long it takes a human analyst to see that alert, log into a console, verify the threat, and actually change a firewall rule (20 minutes).

If the organisation’s human decision latency is 20 minutes, paying an exponential financial premium to maintain a 10-second stateless system latency is not a competitive business strategy. It is an expensive vanity project. You are subsidising a hyperscaler’s profit margin.

The leadership pivot: FinOps enablement

The executive job is not to micro-manage queries. It is to foster a culture where engineering teams align technical velocity with unit economics:

Democratise cost visibility. Engineers cannot optimise what they cannot see. Pipe query-cost metrics directly into developer environments or internal communication channels. When an engineer can see that a newly deployed pipeline cost $4,000 to run yesterday, optimisation happens naturally.
Treat cloud efficiency as a core product metric. Cloud compute is a direct variable cost of your product. If an infrastructure team deploys a hyper-fast real-time system that completely erodes product gross margins because of lazy storage abstractions, that deployment is an architectural failure, not a business win.
Brief leaders to look for linear cost. Architects and engineering leaders should investigate anything that presents as a linear cost. Linear in a fixed-output system is a smell.

The architect lens: Medallion isolation strategy

For the system designer, the fix for the proof-of-concept trap is a tailored Medallion/Lambda approach. Instead of pointing high-frequency analytics directly at a raw object storage dump, the system is broken into distinct, isolated operational layers.

The ephemeral Bronze landing zone

Instead of leaving data in object storage and relying on high-overhead Schema-on-Read external tables, execute a targeted, isolated ingestion step. Programmatically isolate the storage search path down to a rolling two-hour file window using dynamic wildcards, then execute a native LOAD DATA OVERWRITE statement into an ephemeral, partitioned, clustered Bronze staging table. This shifts the raw ingestion cost from an $O(n)$ historical file scan to a flat, deterministic window overhead.

The stateful Silver watermark

To persist data cleanly without introducing high-overhead third-party state orchestration tools, let the data warehouse manage state natively. Running a localised MAX() watermark calculation over a strict lookback window (trailing 24 hours, for instance) lets the architecture instantly determine where the previous execution ended. The compute engine processes only the delta, applies inline transformations (parsing nested cookies, cleaning URI parameters), deduplicates records via native window functions, and appends the clean data into the persistent Silver table.

The engineer lens: production implementation

At the implementation layer, the theoretical optimisations turn into concrete BigQuery procedural scripts that break the cost curve.

Bronze window ingestion

Every 15 minutes, this routine runs a dynamic string evaluation to calculate the current and previous hour’s file layout in Google Cloud Storage. It overwrites the staging table, loading only the necessary files into a native, hourly-partitioned columnar environment.

-- Step 1: Isolate the file scan path and execute a native load overwrite
BEGIN
  DECLARE current_hour STRING;
  DECLARE prev_hour STRING;

  SET current_hour = FORMAT_TIMESTAMP('gs://enterprise-cdn-logs/%Y%m%dT%H*.log.gz', CURRENT_TIMESTAMP());
  SET prev_hour    = FORMAT_TIMESTAMP('gs://enterprise-cdn-logs/%Y%m%dT%H*.log.gz', TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR));

  EXECUTE IMMEDIATE FORMAT("""
    LOAD DATA OVERWRITE `your_bronze_project.your_dataset.your_table`
    (
      [__insert schema__]
    )
    PARTITION BY TIMESTAMP_TRUNC(StartTimestamp, HOUR)
    CLUSTER BY [__fields__]
    FROM FILES (
      format = 'JSON',
      uris = ['%s', '%s'],
      compression = 'GZIP',
      ignore_unknown_values = true
    )
  """, current_hour, prev_hour);
END;

Listing 2. Ephemeral Bronze window ingestion: isolate the file-scan path and execute a native LOAD DATA OVERWRITE.

Stateful Silver ingestion and deduplication

Once the raw logs are materialised inside the Bronze layer, the pipeline reads the highest timestamp from the persistent Silver table to determine the watermark. Note the deliberate constraint on the watermark lookback: WHERE StartTimestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY). This ensures BigQuery never scans the entire historical Silver table to compute the state, keeping the watermark discovery scan virtually free.

-- Step 2: Calculate the watermark delta, transform, deduplicate, and insert
INSERT INTO `your_silver_project.your_dataset.logs_persistent`
WITH watermark AS (
  -- Bounded lookback keeps the watermark calculation query cheap
  SELECT
    COALESCE(MAX(StartTimestamp), TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)) AS max_ts
  FROM `your_silver_project.your_dataset.logs_persistent`
  WHERE StartTimestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
)
SELECT
  [__add your schema__]
FROM `your_bronze_project.your_dataset.your_table` AS s
CROSS JOIN watermark
WHERE s.StartTimestamp > watermark.max_ts
-- Inline deduplication filters out late-arriving network duplicates on the fly
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY [your_unique_id], StartTimestamp
  ORDER BY EdgeStartTimestamp DESC
) = 1;

Listing 3. Stateful Silver ingestion: bounded watermark, inline deduplication, append-only insert.

The takeaway: trivial code, order-of-magnitude results

This pipeline layout completely isolates the variable cost. Because the system transitioned from a stateless table read on raw files to an event-driven ingestion window and a stateful insert loop, the processing metrics flat-line.

The next time an engineering or architecture team brings a “minor” infrastructure modification proposal to sprint planning, asking to adjust an ingestion path or isolate a staging layer, remember this code. On paper it looks like a standard code change. In reality it is the structural difference between a flat-line, predictable infrastructure footprint and a 100x compounding financial problem.

This is the kind of work Sakura Sky’s Data & AI practice does inside client estates, and the kind of pattern our Managed Services team runs in production for clients who would rather buy the outcome than staff the rotation.

References

Amazon Web Services, 2026. Amazon Athena pricing. Amazon Web Services. Available at: https://aws.amazon.com/athena/pricing/ [Accessed 29 June 2026].

Google Cloud, 2026. BigQuery pricing. Google Cloud documentation. Available at: https://cloud.google.com/bigquery/pricing [Accessed 29 June 2026].

Snowflake, 2026. External tables. Snowflake documentation. Available at: https://docs.snowflake.com/en/user-guide/tables-external-intro [Accessed 29 June 2026].