yahoo / maha

A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Apache License 2.0
129 stars 58 forks source link

Allow Binary StrType columns, restrict filtering #1043

Closed ryankwagner closed 1 year ago

ryankwagner commented 1 year ago

Issue:

Resolution:

Below queries show behavior change in PrestoV1/HiveV2. This will make for simpler checking of when this PR is meant to "do"

Example existing query, using start_time as the Fact decoded column:

SELECT CAST(advertiser_id as VARCHAR) AS advertiser_id, CAST(mang_decoded_binary_col2 as VARCHAR) AS mang_decoded_binary_col2, CAST(mang_campaign_name as VARCHAR) AS mang_campaign_name, CAST(mang_impressions as VARCHAR) AS mang_impressions
FROM(
SELECT advertiser_id AS advertiser_id, decodeUDF(ad_group_id, 1, start_time, null) AS mang_decoded_binary_col2, mang_campaign_name AS mang_campaign_name, impressions AS mang_impressions
FROM(
SELECT COALESCE(CAST(account_id as BIGINT), 0) advertiser_id, getCsvEscapedString(CAST(COALESCE(c1.mang_campaign_name, '') AS VARCHAR)) mang_campaign_name, SUM(impressions) AS impressions, COALESCE(CAST(start_time as BIGINT), 0) start_time
FROM(SELECT account_id, campaign_id, SUM(impressions) impressions, start_time
FROM s_stats_fact_underlying
WHERE (account_id = 12345) AND (stats_date >= '2023-06-14' AND stats_date <= '2023-06-21')
GROUP BY account_id, campaign_id, start_time

       )
ssfu0
LEFT OUTER JOIN (
SELECT campaign_name AS mang_campaign_name, id c1_id
FROM campaign_presto_underlying
WHERE ((load_time = '%DEFAULT_DIM_PARTITION_PREDICTATE%' ) AND (shard = 'all' )) AND (advertiser_id = 12345)
)
c1
ON
CAST(ssfu0.campaign_id AS VARCHAR) = CAST(c1.c1_id AS VARCHAR)

GROUP BY COALESCE(CAST(account_id as BIGINT), 0), getCsvEscapedString(CAST(COALESCE(c1.mang_campaign_name, '') AS VARCHAR)), COALESCE(CAST(start_time as BIGINT), 0)
ORDER BY impressions DESC) OgbQueryAlias
)
        queryAlias LIMIT 200

Example new query, now decoding on the OGB Binary column:

SELECT CAST(advertiser_id as VARCHAR) AS advertiser_id, CAST(mang_decoded_binary_col as VARCHAR) AS mang_decoded_binary_col, CAST(mang_campaign_name as VARCHAR) AS mang_campaign_name, CAST(mang_impressions as VARCHAR) AS mang_impressions
FROM(
SELECT advertiser_id AS advertiser_id, decodeUDF(ad_group_id, 1, binarycol, null) AS mang_decoded_binary_col, mang_campaign_name AS mang_campaign_name, impressions AS mang_impressions
FROM(
SELECT COALESCE(CAST(account_id as BIGINT), 0) advertiser_id, getCsvEscapedString(CAST(COALESCE(c1.mang_campaign_name, '') AS VARCHAR)) mang_campaign_name, SUM(impressions) AS impressions, CAST(binarycol as BINARY) binarycol
FROM(SELECT account_id, campaign_id, SUM(impressions) impressions, binarycol
FROM s_stats_fact_underlying
WHERE (account_id = 12345) AND (stats_date >= '2023-06-14' AND stats_date <= '2023-06-21')
GROUP BY account_id, campaign_id, binarycol

       )
ssfu0
LEFT OUTER JOIN (
SELECT campaign_name AS mang_campaign_name, id c1_id
FROM campaign_presto_underlying
WHERE ((load_time = '%DEFAULT_DIM_PARTITION_PREDICTATE%' ) AND (shard = 'all' )) AND (advertiser_id = 12345)
)
c1
ON
CAST(ssfu0.campaign_id AS VARCHAR) = CAST(c1.c1_id AS VARCHAR)

GROUP BY COALESCE(CAST(account_id as BIGINT), 0), getCsvEscapedString(CAST(COALESCE(c1.mang_campaign_name, '') AS VARCHAR)), CAST(binarycol as BINARY)
ORDER BY impressions DESC) OgbQueryAlias
)
        queryAlias LIMIT 200

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.