yahoo / maha

A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Apache License 2.0
129 stars 58 forks source link

Fix: OGB query should not render unused primitive cols at pre-outer layer #1049

Closed ZhaoqiWang0605 closed 1 year ago

ZhaoqiWang0605 commented 1 year ago

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Motivation

The OGB query (Hive, Presto, Bigquery) cannot get an aggregated result if a derived NoopRollup metrics is queried. Because the NoopRollup column is rendered at the pre-outer layer, where the primitive columns which has rendered at the inner most layer will be rendered again at the pre-outer layer.

Solution

Before rendering primitive columns at the pre-outer layer, remove the primitive columns who are only used by NoopRollup columns who will be rendered at the pre-outer layer as well

Example:

Here is a derived NoopRollup metrics:

HiveDerFactCol("derived_impressions", IntType(), SUM(DECODE("{device_id}", "1", "{impressions}", "0")), rollupExpression = NoopRollup)

Before the fix, the generated OGB query will be like the following query. Can find CASE WHEN (device_id IN (11)) THEN 'Desktop' WHEN (device_id IN (22)) THEN 'Tablet' WHEN (device_id IN (33)) THEN 'SmartPhone' WHEN (device_id IN (-1)) THEN 'UNKNOWN' ELSE 'UNKNOWN' END device_id was rendered at the pre-outer layer, even though only derived_impressions who is rendered at the same layer uses it

SELECT CONCAT_WS(',', CAST(NVL(mang_advertiser_name,'') AS STRING),CAST(NVL(mang_impressions,'') AS STRING),CAST(NVL(mang_derived_impressions,'') AS STRING))
FROM(
SELECT mang_advertiser_name AS mang_advertiser_name, impressions AS mang_impressions, derived_impressions AS mang_derived_impressions
FROM(
SELECT COALESCE(a1.mang_advertiser_name, 'NA') mang_advertiser_name, SUM(impressions) AS impressions, CASE WHEN (device_id IN (11)) THEN 'Desktop' WHEN (device_id IN (22)) THEN 'Tablet' WHEN (device_id IN (33)) THEN 'SmartPhone' WHEN (device_id IN (-1)) THEN 'UNKNOWN' ELSE 'UNKNOWN' END device_id, (SUM(decodeUDF(device_id, 1, impressions, 0))) AS derived_impressions
FROM(SELECT account_id, SUM(impressions) impressions, device_id
FROM s_stats_fact_underlying
WHERE (account_id = 12345) AND (stats_date >= '2023-07-18' AND stats_date <= '2023-07-25')
GROUP BY account_id, device_id

       )
ssfu0
LEFT OUTER JOIN (
SELECT name AS mang_advertiser_name, id a1_id
FROM advertiser_hive
WHERE ((load_time = '%DEFAULT_DIM_PARTITION_PREDICTATE%' ) AND (shard = 'all' )) AND (id = 12345)
)
a1
ON
ssfu0.account_id = a1.a1_id

GROUP BY COALESCE(a1.mang_advertiser_name, 'NA'), CASE WHEN (device_id IN (11)) THEN 'Desktop' WHEN (device_id IN (22)) THEN 'Tablet' WHEN (device_id IN (33)) THEN 'SmartPhone' WHEN (device_id IN (-1)) THEN 'UNKNOWN' ELSE 'UNKNOWN' END
) OgbQueryAlias
) queryAlias LIMIT 200

After the fix, the generated OGB query will be:

SELECT CONCAT_WS(',', CAST(NVL(mang_advertiser_name,'') AS STRING),CAST(NVL(mang_impressions,'') AS STRING),CAST(NVL(mang_derived_impressions,'') AS STRING))
FROM(
SELECT mang_advertiser_name AS mang_advertiser_name, impressions AS mang_impressions, derived_impressions AS mang_derived_impressions
FROM(
SELECT COALESCE(a1.mang_advertiser_name, 'NA') mang_advertiser_name, SUM(impressions) AS impressions, (SUM(decodeUDF(device_id, 1, impressions, 0))) AS derived_impressions
FROM(SELECT account_id, SUM(impressions) impressions, device_id
FROM s_stats_fact_underlying
WHERE (account_id = 12345) AND (stats_date >= '2023-07-18' AND stats_date <= '2023-07-25')
GROUP BY account_id, device_id

       )
ssfu0
LEFT OUTER JOIN (
SELECT name AS mang_advertiser_name, id a1_id
FROM advertiser_hive
WHERE ((load_time = '%DEFAULT_DIM_PARTITION_PREDICTATE%' ) AND (shard = 'all' )) AND (id = 12345)
)
a1
ON
ssfu0.account_id = a1.a1_id

GROUP BY COALESCE(a1.mang_advertiser_name, 'NA')
) OgbQueryAlias
) queryAlias LIMIT 200