Impala的count(distinct QUESTION_ID) 与ndv(QUESTION_ID)
在impala中,一个select执行多个count(distinct col)会报错,举例:
select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num, count(distinct CREATOR_ID) as creator_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2
报错信息:
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT QUESTION_BUSI_ID); deviating function: count(DISTINCT CREATOR_ID)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.
这时候,可通过以下方法解决:
1、得到的是近似值,数据量越大越不准确:
(1)SQL运行前,先运行命令:set APPX_COUNT_DISTINCT=true;
set APPX_COUNT_DISTINCT=true; select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num, count(distinct CREATOR_ID) as creator_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2 order by C_DEPT2
(2)将count(distinct col)用函数ndv(col)代替
select C_DEPT2, ndv(QUESTION_BUSI_ID) as wo_num, ndv(CREATOR_ID) as creator_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2 order by C_DEPT2
需要注意的是,在set APPX_COUNT_DISTINCT=true;的情况下,使用count(distinct col)会自动转化成ndv(col),得到的是近似值,所以以上两种方法的结果数据一致。
2、精确值。拆分为子查询,再关联,如下:
set APPX_COUNT_DISTINCT = false; -- 将参数置为false,使用count(distinct col),确保不会转化成ndv(col)
select a.C_DEPT2, a.wo_num, b.creator_num from (select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2) a left join (select C_DEPT2, count(distinct CREATOR_ID) as creator_num from pdm.kudu_q_basic where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2) b on a.C_DEPT2 = b.C_DEPT2 order by a.C_DEPT2
验证:
select C_DEPT2, count(*) from pdm.kudu_q_basic -- 表中无重复数据 where substr(CREATE_DATE, 1, 7) = '2020-10' group by C_DEPT2 order by C_DEPT2
总结:解决在impala中一个select执行多个count(distinct col)报错问题,可以用过设置参数set APPX_COUNT_DISTINCT = true;或将count(distinct col)用ndv(col)解决,但得到的是近似值,不准确。还可以通过分别在子查询中进行count(distinct col)再关联得到准确值,但要注意参数 APPX_COUNT_DISTINCT = false,不然会自动转化为ndv(col)得到的还是近似值。