Hookbase
Docs
GuideAPI ReferenceIntegrationsUse CasesCLIMCP
Getting StartedSDK ReferencePortal ComponentsAPI Reference
Get Started

Getting Started

IntroductionQuick StartBest PracticesPipeline Architecture

Core Concepts

SourcesDestinationsRoutes

Advanced Features

TransformsAI TransformsFiltersSchemasData WarehousesDeduplicationCustom DomainsTunnelsCron JobsScheduled Sends

Enterprise Security

OverviewCircuit BreakerFailover DestinationsIP FilteringStatic IP DeliveryNotification ChannelsObservability ExportField EncryptionRedaction PoliciesAudit Logs

Kubernetes

Operator GuideCRD ReferenceHelm Chart

Operations

Plans & LimitsProduction ReadinessTroubleshootingTestingComparison
DocsReceiveGuideWarehouse Loading

Loading Events into Data Warehouses

Hookbase delivers webhook events to your data warehouse using object storage as the integration point. A warehouse destination batches events and writes them as JSONL (or JSON) files to Amazon S3, Cloudflare R2, Google Cloud Storage, or Azure Blob Storage. From there, every major warehouse can load or query those files directly.

This guide shows how to get Hookbase events into Snowflake, BigQuery, and Databricks, or query them in place with Amazon Athena and Redshift Spectrum.

Info

Prerequisite: First set up a warehouse destination so events are landing in your bucket. See Destinations → Warehouse Destinations for S3, R2, GCS, and Azure Blob configuration. Warehouse destinations are available on Pro and Business plans.

The pattern

Hookbase owns the extract + load half of the pipeline: it receives webhooks, applies your filters and transforms, batches the results, and writes durable files to object storage. Your warehouse owns the transform + query half. Because the hand-off is plain files in cloud storage, the same setup works with any engine that can read a bucket.

There are two ways to consume the files, and you can mix them:

  • Query in place — define an external table over the bucket prefix. No copy step; the warehouse reads files on demand. Best for ad-hoc analysis and archival.
  • Ingest into native tables — COPY the files into managed tables, either as a scheduled batch or continuously as new files land. Best for dashboards and joins at scale.

The record shape

By default, each line of a Hookbase file is one event:

{"event_id":"evt_abc123","received_at":"2026-02-21T14:30:00Z","payload":{"type":"payment_intent.succeeded","data":{"amount":2500}}}

So the natural landing schema is three columns: event_id (string), received_at (timestamp), and payload (a semi-structured / JSON column).

Tip

If you'd rather land flat, typed columns instead of a nested payload, configure field mapping on the destination. Field mapping projects payload fields into top-level columns with explicit types (string, number, boolean, timestamp, json) — which makes external-table schemas in Athena and Redshift Spectrum far simpler.

Snowflake

Use a storage integration and an external stage, then COPY INTO a table with a VARIANT column.

-- 1. One-time: grant Snowflake access to the bucket (S3 shown; GCS/Azure analogous)
CREATE STORAGE INTEGRATION hookbase_s3
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/hookbase-snowflake'
  STORAGE_ALLOWED_LOCATIONS = ('s3://my-data-lake/webhooks/');
 
-- 2. File format + external stage
CREATE FILE FORMAT hookbase_jsonl TYPE = JSON STRIP_OUTER_ARRAY = FALSE;
 
CREATE STAGE hookbase_stage
  URL = 's3://my-data-lake/webhooks/'
  STORAGE_INTEGRATION = hookbase_s3
  FILE_FORMAT = hookbase_jsonl;
 
-- 3. Landing table
CREATE TABLE webhook_events (
  event_id    STRING,
  received_at TIMESTAMP_NTZ,
  payload     VARIANT
);
 
-- 4. Bulk load existing files
COPY INTO webhook_events
FROM (
  SELECT $1:event_id::string, $1:received_at::timestamp_ntz, $1:payload
  FROM @hookbase_stage
)
PATTERN = '.*[.]jsonl';

For continuous loading, point a Snowpipe at the same stage so new files load automatically:

CREATE PIPE hookbase_pipe AUTO_INGEST = TRUE AS
COPY INTO webhook_events
FROM (
  SELECT $1:event_id::string, $1:received_at::timestamp_ntz, $1:payload
  FROM @hookbase_stage
);

BigQuery

Hookbase writes natively to GCS, which BigQuery loads directly. Use a LOAD DATA statement into a table with a native JSON column:

LOAD DATA INTO mydataset.webhook_events (
  event_id    STRING,
  received_at TIMESTAMP,
  payload     JSON
)
FROM FILES (
  format = 'JSON',                       -- newline-delimited JSON
  uris = ['gs://my-gcs-bucket/webhooks/*.jsonl']
);

To query in place without loading, define an external table over the prefix:

CREATE EXTERNAL TABLE mydataset.webhook_events_ext
OPTIONS (
  format = 'NEWLINE_DELIMITED_JSON',
  uris = ['gs://my-gcs-bucket/webhooks/*']
);

Info

To read S3 (or R2) files from BigQuery instead of GCS, use BigLake / BigQuery Omni with an external connection. If you're on GCP, sending Hookbase events to a GCS warehouse destination is the simplest path.

Databricks

Create a Delta table and COPY INTO it from the bucket. Databricks infers the JSON structure, so payload lands as a nested column you can query with the : operator.

CREATE TABLE IF NOT EXISTS webhook_events;
 
COPY INTO webhook_events
FROM 's3://my-data-lake/webhooks/'
FILEFORMAT = JSON
COPY_OPTIONS ('mergeSchema' = 'true');

For a continuously updating table, use Auto Loader, which incrementally picks up new files as Hookbase writes them:

(spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/Volumes/main/default/hookbase/_schema")
    .load("s3://my-data-lake/webhooks/")
    .writeStream
    .option("checkpointLocation", "/Volumes/main/default/hookbase/_checkpoint")
    .toTable("webhook_events"))

Amazon Athena & Redshift Spectrum

Both query files directly in S3 with no load step. Nested JSON works best when you flatten it first, so this is where field mapping shines — map the fields you query into typed top-level columns, then declare them in the external table:

CREATE EXTERNAL TABLE webhook_events (
  event_id    string,
  received_at timestamp,
  event_type  string,
  amount      bigint
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://my-data-lake/webhooks/';

Then query as usual:

SELECT event_type, COUNT(*), SUM(amount)
FROM webhook_events
WHERE received_at > current_timestamp - interval '7' day
GROUP BY event_type;

Querying the raw payload

If you kept the nested payload column instead of flattening it, each engine extracts fields with its own JSON syntax:

-- Snowflake
SELECT payload:type::string AS event_type,
       payload:data.amount::number AS amount
FROM webhook_events;
 
-- BigQuery
SELECT JSON_VALUE(payload, '$.type') AS event_type,
       JSON_VALUE(payload, '$.data.amount') AS amount
FROM mydataset.webhook_events;
 
-- Databricks
SELECT payload:type AS event_type,
       payload:data.amount AS amount
FROM webhook_events;

Deduplicate on event_id

Hookbase retries failed uploads and batches can occasionally overlap, so treat event_id as the idempotency key when building downstream tables:

SELECT *
FROM webhook_events
QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY received_at DESC) = 1;

Tip

Match your destination's partition strategy (date or hour) to how you'll query. Date/hour-partitioned prefixes let Athena, Spectrum, and external tables prune files by time range, which cuts scan cost dramatically on large volumes. JSONL is recommended over a single JSON array — every engine streams newline-delimited records efficiently.

Related

  • Warehouse Destinations — configure S3, R2, GCS, and Azure Blob
  • Field Mapping — project payload fields into typed columns
  • Transforms — reshape payloads before they land in the bucket
  • Filters — only warehouse the events you care about
PreviousSchemasNextDeduplication

On this page

The patternThe record shapeSnowflakeBigQueryDatabricksAmazon Athena & Redshift SpectrumQuerying the raw payloadDeduplicate on event_idRelated