CrowdStrike Falcon’s event stream is a powerful source of real-time security telemetry, offering deep visibility into endpoint activity. However, capturing this high-volume “firehose” of data presents a classic engineering challenge: how do you build a data pipeline that is reliable, scalable, and cost-effective without creating an operational burden?
In this post, we’ll walk through a production-ready, serverless solution that does exactly that. We will build a resilient data collector using a Python service hosted on Google Cloud Run, which securely stores every event in Google Cloud Storage (GCS), ready for ingestion into a SIEM like Google Chronicle or for analysis in BigQuery.
By the end, you’ll understand the architecture, the core connection logic, and how to deploy the entire service using best practices like Infrastructure as Code and secure secret management.
Why Build a Custom Pipeline?
Before we dive into the “how,” let’s address the “why.” While third-party data forwarders exist, building your own serverless pipeline offers compelling advantages for organizations invested in Google Cloud:
- Direct GCS Integration: CrowdStrike’s native streaming connectors are primarily built for AWS S3. This solution provides a first-party, direct-to-GCS path, eliminating the need for complex and costly multi-cloud data transfers. You keep your security data within your GCP environment from start to finish.
- Cost-Effective Alternative to Third-Party Tools: Commercial data integration platforms like BindPlane or Cribl are powerful, but they come with licensing costs and add another layer to your architecture. This serverless approach is extremely cost-effective, as you only pay for the minimal Cloud Run compute and GCS storage you actually use.
- Full Control and Customization: Because you own the code, you have complete control. You can easily modify the Python script to add custom in-flight processing—such as filtering out low-value events to reduce SIEM costs, enriching data with internal context, or routing specific event types to different buckets for varied retention policies.
- Simplicity and Security: This architecture is simple and secure. It’s a direct, point-to-point connection between the Falcon API and your GCP environment, reducing your attack surface and eliminating reliance on intermediate vendors.
The Architecture: A Serverless and Resilient Design
Our goal is to create a “set it and forget it” pipeline. We achieve this by combining a few key Google Cloud services:
- Google Cloud Run: A fully managed, serverless platform that runs our containerized Python application. It automatically handles scaling and ensures our service is always running. We don’t need to manage any VMs.
- Google Cloud Storage (GCS): A durable and highly scalable object storage service. It will serve as our data lake, storing the raw
.jsonl
event files in a partitioned directory structure (YYYY/MM/DD
) that’s optimized for downstream processing. - Google Secret Manager: The secure, central place to store all our configuration including API keys, bucket names, and URLs. This prevents secrets from ever being hardcoded in our source code.
The logic is simple: a long-running Python service connects to the Falcon stream, batches events, and writes them to GCS. If the service ever fails or the connection drops, Cloud Run automatically restarts it, and our script’s built-in resilience ensures it picks up right where it left off.
The Core Logic: How to Really Connect to the Falcon Stream
Connecting to the Falcon event stream isn’t a simple, single API call. Through testing, we discovered a robust, two-step process that is crucial for a stable connection.
Step 1: Get the Stream Metadata with falconpy
First, we use the crowdstrike-falconpy
library to handle authentication and discover the stream’s connection details. This is the only part falconpy
is used for.
# Extracted from our main.py - not a complete script
falcon = EventStreams(
client_id=FALCON_CLIENT_ID,
client_secret=FALCON_CLIENT_SECRET,
base_url=FALCON_BASE_URL,
)
# This call authenticates and returns the stream's unique URL and session token
response = falcon.list_available_streams(app_id=APP_ID)
stream_meta = response["body"]["resources"][0]
session_token = stream_meta.get("sessionToken", {}).get("token")
data_feed_url = stream_meta.get("dataFeedURL")
This gives us a unique dataFeedURL
and a sessionToken
that are valid for a limited time.
Step 2: Consume the Stream with requests
With the metadata in hand, we use the standard requests
library to make a persistent connection to the dataFeedURL
. The key here is the stream=True
parameter.
# Extracted from our main.py - not a complete script
headers = {'Authorization': f'Token {session_token}', 'Accept': 'application/json'}
# stream=True keeps the connection open indefinitely
with requests.get(data_feed_url, headers=headers, stream=True, timeout=360) as resp:
resp.raise_for_status()
logging.info("Successfully connected to stream. Processing events...")
# This loop listens to the open connection and processes lines as they arrive
for line in resp.iter_lines():
if line:
event = json.loads(line)
event_batch.append(event)
This approach creates a durable, long-lived connection. The for
loop will block and wait until the API sends the next event, making it highly efficient.
Built-in Resilience
The entire process is wrapped in a while True:
loop. If the network connection drops for any reason, the requests.get()
call will raise an exception. Our try...except
block catches it, waits for 30 seconds, and allows the loop to restart, establishing a brand new, valid connection. This makes the service self-healing.
Packaging and Deployment
With the logic defined, we need to package and deploy it.
Containerizing with Docker
The Dockerfile
is straightforward. It starts from a slim Python image (python:3.13.0-slim)
, installs the dependencies from requirements.txt
, and uses gunicorn
as a production-grade web server. The --timeout 0
flag is critical, as it tells gunicorn
that our streaming request is allowed to run forever.
# Extracted from our Dockerfile - not complete
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Secure Deployment with gcloud
We use the gcloud
CLI to deploy our service. The most important part of the deployment command is how we handle configuration. Instead of passing secrets as plain text, we mount them directly from Secret Manager.
# From the deployment instructions
gcloud run deploy falcon-streamer \
--source . \
--platform managed \
--region "us-central1" \
--update-secrets="FALCON_CLIENT_ID=falcon-client-id:latest" \
--update-secrets="FALCON_CLIENT_SECRET=falcon-client-secret:latest" \
--update-secrets="GCS_BUCKET_NAME=gcs-bucket-name:latest" \
--update-secrets="FALCON_BASE_URL=falcon-base-url:latest"
This command tells Cloud Run to fetch the latest version of each secret and inject it into the running container as an environment variable. Our Python script then reads these variables using os.environ.get()
. This is a secure and highly flexible way to manage configuration.
The Outcome: Your Automated Data Pipeline
After deploying, you have a fully automated, serverless pipeline. The service will run continuously, capturing every event from your Falcon stream and writing it to GCS in organized, time-stamped .jsonl
files.
Where to from here? You could:
- Set up a Chronicle Feed: Point Chronicle directly at your GCS bucket for automatic, continuous ingestion into your SIEM.
- Analyze with BigQuery: Create an external table in BigQuery on top of your GCS files to run complex SQL queries against your security data.
- Trigger Real-Time Alerts: Configure a GCS trigger to run a Cloud Function on each new file created, allowing you to build custom, real-time alerting logic.
This architecture provides a robust foundation for leveraging your Falcon security data across the Google Cloud ecosystem.