SQL Compatibility Notes
HarborSQL executes SQL with Apache DataFusion. Most scalar SQL functions follow DataFusion behavior, which can differ from Databricks SQL Warehouse for edge cases. This page records known differences and recommended workarounds.
HarborSQL Compatibility Rewrites
HarborSQL applies a small set of compatibility rewrites before handing SQL to DataFusion. These rewrites keep common Databricks SQL connector and benchmark queries working while leaving general SQL semantics to DataFusion.
- Unaliased
COUNT(*)projections are aliased ascount(1)by default. - Unaliased expression projections are assigned Databricks-style metadata names by default.
- Simple contains-style
LIKE '%literal%'predicates can be rewritten to DataFusioncontains(...). - Single-capture
REGEXP_REPLACE(..., '$1')shapes can be rewritten to a HarborSQL UDF. extract(minute FROM timestamp)is rewritten to a HarborSQL UDF for Databricks-compatible minute extraction.- Databricks
get(array, zero_based_index)is rewritten to DataFusionarray_element(...)with one-based index adjustment and negative-index null behavior.get(...).fieldis rewritten to DataFusion named-field bracket access.
See Delta Types Compatibility for the
decimal, binary, nested-result, and get(array, index) compatibility notes.
REGEXP_REPLACE and Embedded Line Breaks
A ClickBench Q29 validation mismatch isolated a difference in this expression:
REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '$1')
The pattern extracts a domain-like key from a URL. The compatibility gap is the unflagged dot in the suffix:
.*$
DataFusion uses Rust regex semantics: . does not match \n unless DOTALL mode
is enabled. For strings like this:
http://example.com/path<LF>more text
DataFusion does not match the full pattern, so REGEXP_REPLACE returns the
original multi-line string. Databricks SQL Warehouse grouped the affected
ClickBench table-column values under example.com, so aggregate COUNT(*)
values changed.
Workaround
Make DOTALL behavior explicit in the pattern:
REGEXP_REPLACE(Referer, '(?s)^https?://(?:www\.)?([^/]+)/.*$', '$1')
or use an explicit any-character class:
REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/[\s\S]*$', '$1')
The (?s) form is preferred. It keeps the grouping-key extraction compatible
for values with embedded line breaks without changing other expressions such as
length(Referer).
Runtime configuration
There is no HarborSQL or DataFusion runtime configuration setting that turns
DOTALL on globally for REGEXP_REPLACE. DOTALL is a regex-pattern option, so
opt in by rewriting the query pattern when this behavior is desired.
HarborSQL intentionally does not rewrite user regexes automatically. Applying DOTALL broadly can change valid queries where newlines are meant to be boundaries.