https://github.com/zq2599/blog_demos

内容:所有原创文章分类汇总及配套源码,涉及Java、Docker、Kubernetes、DevOPS等;

  1. 基本数据类型
  2. 复杂数据类型
  3. 内部表和外部表
  4. 分区表
  5. 分桶
  6. HiveQL基础
  7. 内置函数
  8. Sqoop
  9. 基础UDF
  10. 用户自定义聚合函数(UDAF)
  11. UDTF
  • 本文是《hive学习笔记》系列的第七篇,前文熟悉了HiveQL的常用语句,接下来把常用的内置函数简单过一遍,分为以下几部分:
  1. 数学
  2. 字符
  3. json处理
  4. 转换
  5. 日期
  6. 条件
  7. 聚合
  1. 本次实战要准备两个表:学生表和住址表,字段都很简单,如下图所示,学生表有个住址ID字段,是住址表里的记录的唯一ID:

在这里插入图片描述
2. 先创建住址表:

  1. create table address (addressid int, province string, city string)
  2. row format delimited
  3. fields terminated by ',';
  1. 创建address.txt文件,内容如下:
  1. 1,guangdong,guangzhou
  2. 2,guangdong,shenzhen
  3. 3,shanxi,xian
  4. 4,shanxi,hanzhong
  5. 6,jiangshu,nanjing
  1. 加载数据到address表:
  1. load data
  2. local inpath '/home/hadoop/temp/202010/25/address.txt'
  3. into table address;
  1. 创建学生表,其addressid字段关联了address表的addressid字段:
  1. create table student (name string, age int, addressid int)
  2. row format delimited
  3. fields terminated by ',';
  1. 创建student.txt文件,内容如下:
  1. tom,11,1
  2. jerry,12,2
  3. mike,13,3
  4. john,14,4
  5. mary,15,5
  1. 加载数据到student表:
  1. load data
  2. local inpath '/home/hadoop/temp/202010/25/student.txt'
  3. into table student;
  1. 至此,本次操作所需数据已准备完毕,如下所示:
  1. hive> select * from address;
  2. OK
  3. 1 guangdong guangzhou
  4. 2 guangdong shenzhen
  5. 3 shanxi xian
  6. 4 shanxi hanzhong
  7. 6 jiangshu nanjing
  8. Time taken: 0.043 seconds, Fetched: 5 row(s)
  9. hive> select * from student;
  10. OK
  11. tom 11 1
  12. jerry 12 2
  13. mike 13 3
  14. john 14 4
  15. mary 15 5
  16. Time taken: 0.068 seconds, Fetched: 5 row(s)
  • 开始体验内置函数;
  1. 进入hive控制台;
  2. 执行命令show functions;显示内置函数列表:
  1. hive> show functions;
  2. OK
  3. !
  4. !=
  5. %
  6. &
  7. *
  8. +
  9. -
  10. /
  11. <
  12. <=
  13. <=>
  14. <>
  15. =
  16. ==
  17. >
  18. >=
  19. ^
  20. abs
  21. acos
  22. add_months
  23. and
  24. array
  25. array_contains
  26. ascii
  27. asin
  28. assert_true
  29. atan
  30. avg
  31. base64
  32. between
  33. bin
  34. case
  35. cbrt
  36. ceil
  37. ceiling
  38. coalesce
  39. collect_list
  40. collect_set
  41. compute_stats
  42. concat
  43. concat_ws
  44. context_ngrams
  45. conv
  46. corr
  47. cos
  48. count
  49. covar_pop
  50. covar_samp
  51. create_union
  52. cume_dist
  53. current_database
  54. current_date
  55. current_timestamp
  56. current_user
  57. date_add
  58. date_format
  59. date_sub
  60. datediff
  61. day
  62. dayofmonth
  63. decode
  64. degrees
  65. dense_rank
  66. div
  67. e
  68. elt
  69. encode
  70. ewah_bitmap
  71. ewah_bitmap_and
  72. ewah_bitmap_empty
  73. ewah_bitmap_or
  74. exp
  75. explode
  76. factorial
  77. field
  78. find_in_set
  79. first_value
  80. floor
  81. format_number
  82. from_unixtime
  83. from_utc_timestamp
  84. get_json_object
  85. greatest
  86. hash
  87. hex
  88. histogram_numeric
  89. hour
  90. if
  91. in
  92. in_file
  93. index
  94. initcap
  95. inline
  96. instr
  97. isnotnull
  98. isnull
  99. java_method
  100. json_tuple
  101. lag
  102. last_day
  103. last_value
  104. lcase
  105. lead
  106. least
  107. length
  108. levenshtein
  109. like
  110. ln
  111. locate
  112. log
  113. log10
  114. log2
  115. lower
  116. lpad
  117. ltrim
  118. map
  119. map_keys
  120. map_values
  121. matchpath
  122. max
  123. min
  124. minute
  125. month
  126. months_between
  127. named_struct
  128. negative
  129. next_day
  130. ngrams
  131. noop
  132. noopstreaming
  133. noopwithmap
  134. noopwithmapstreaming
  135. not
  136. ntile
  137. nvl
  138. or
  139. parse_url
  140. parse_url_tuple
  141. percent_rank
  142. percentile
  143. percentile_approx
  144. pi
  145. pmod
  146. posexplode
  147. positive
  148. pow
  149. power
  150. printf
  151. radians
  152. rand
  153. rank
  154. reflect
  155. reflect2
  156. regexp
  157. regexp_extract
  158. regexp_replace
  159. repeat
  160. reverse
  161. rlike
  162. round
  163. row_number
  164. rpad
  165. rtrim
  166. second
  167. sentences
  168. shiftleft
  169. shiftright
  170. shiftrightunsigned
  171. sign
  172. sin
  173. size
  174. sort_array
  175. soundex
  176. space
  177. split
  178. sqrt
  179. stack
  180. std
  181. stddev
  182. stddev_pop
  183. stddev_samp
  184. str_to_map
  185. struct
  186. substr
  187. substring
  188. sum
  189. tan
  190. to_date
  191. to_unix_timestamp
  192. to_utc_timestamp
  193. translate
  194. trim
  195. trunc
  196. ucase
  197. unbase64
  198. unhex
  199. unix_timestamp
  200. upper
  201. var_pop
  202. var_samp
  203. variance
  204. weekofyear
  205. when
  206. windowingtablefunction
  207. xpath
  208. xpath_boolean
  209. xpath_double
  210. xpath_float
  211. xpath_int
  212. xpath_long
  213. xpath_number
  214. xpath_short
  215. xpath_string
  216. year
  217. |
  218. ~
  219. Time taken: 0.003 seconds, Fetched: 216 row(s)
  1. lower函数为例,执行命令describe function lower;即可查看lower函数的说明:
  1. hive> describe function lower;
  2. OK
  3. lower(str) - Returns str with all characters changed to lowercase
  4. Time taken: 0.005 seconds, Fetched: 1 row(s)
  • 接下来从计算函数开始,体验常用函数;
  • 先执行以下命令,使查询结果中带有字段名:
  1. set hive.cli.print.header=true;
  1. 加法+
  1. hive> select name, age, age+1 as add_value from student;
  2. OK
  3. name age add_value
  4. tom 11 12
  5. jerry 12 13
  6. mike 13 14
  7. john 14 15
  8. mary 15 16
  9. Time taken: 0.098 seconds, Fetched: 5 row(s)
  1. 减法(-)、乘法(*)、除法(/)的使用与加法类似,不再赘述了;
  2. 四舍五入round
  1. hive> select round(1.1), round(1.6);
  2. OK
  3. _c0 _c1
  4. 1.0 2.0
  5. Time taken: 0.028 seconds, Fetched: 1 row(s)
  1. 向上取整ceil
  1. hive> select ceil(1.1);
  2. OK
  3. _c0
  4. 2
  5. Time taken: 0.024 seconds, Fetched: 1 row(s)
  1. 向下取整floor
  1. hive> select floor(1.1);
  2. OK
  3. _c0
  4. 1
  5. Time taken: 0.024 seconds, Fetched: 1 row(s)
  1. 平方pow,例如pow(2,3)表示2的三次方,等于8:
  1. hive> select pow(2,3);
  2. OK
  3. _c0
  4. 8.0
  5. Time taken: 0.027 seconds, Fetched: 1 row(s)
  1. 取模pmod
  1. hive> select pmod(10,3);
  2. OK
  3. _c0
  4. 1
  5. Time taken: 0.059 seconds, Fetched: 1 row(s)
  1. 转小写lower,转大写upper
  1. hive> select lower(name), upper(name) from student;
  2. OK
  3. _c0 _c1
  4. tom TOM
  5. jerry JERRY
  6. mike MIKE
  7. john JOHN
  8. mary MARY
  9. Time taken: 0.051 seconds, Fetched: 5 row(s)
  1. 字符串长度length
  1. hive> select name, length(name) from student;
  2. OK
  3. tom 3
  4. jerry 5
  5. mike 4
  6. john 4
  7. mary 4
  8. Time taken: 0.322 seconds, Fetched: 5 row(s)
  1. 字符串拼接concat
  1. hive> select concat("prefix_", name) from student;
  2. OK
  3. prefix_tom
  4. prefix_jerry
  5. prefix_mike
  6. prefix_john
  7. prefix_mary
  8. Time taken: 0.106 seconds, Fetched: 5 row(s)
  1. 子串substr,substr(xxx,2)表示从第二位开始到右边所有,substr(xxx,2,3)表示从第二位开始取三个字符:
  1. hive> select substr("0123456",2);
  2. OK
  3. 123456
  4. Time taken: 0.067 seconds, Fetched: 1 row(s)
  5. hive> select substr("0123456",2,3);
  6. OK
  7. 123
  8. Time taken: 0.08 seconds, Fetched: 1 row(s)
  1. 去掉前后空格trim
  1. hive> select trim(" 123 ");
  2. OK
  3. 123
  4. Time taken: 0.065 seconds, Fetched: 1 row(s)

为了使用json处理的函数,先准备一些数据:

  1. 先创建表t15,只有一个字段用于保存字符串:
  1. create table t15(json_raw string)
  2. row format delimited;
  1. 创建t15.txt文件,内容如下:
  1. {"name":"tom","age":"10"}
  2. {"name":"jerry","age":"11"}
  1. 加载数据到t15表:
  1. load data
  2. local inpath '/home/hadoop/temp/202010/25/015.txt'
  3. into table t15;
  1. 使用get_json_object函数,解析json_raw字段,分别取出指定nameage属性:
  1. select
  2. get_json_object(json_raw, "$.name"),
  3. get_json_object(json_raw, "$.age")
  4. from t15;

得到结果:

  1. hive> select
  2. > get_json_object(json_raw, "$.name"),
  3. > get_json_object(json_raw, "$.age")
  4. > from t15;
  5. OK
  6. tom 10
  7. jerry 11
  8. Time taken: 0.081 seconds, Fetched: 2 row(s)
  1. 获取当前日期current_date
  1. hive> select current_date();
  2. OK
  3. 2020-11-02
  4. Time taken: 0.052 seconds, Fetched: 1 row(s)
  1. 获取当前时间戳current_timestamp
  1. hive> select current_timestamp();
  2. OK
  3. 2020-11-02 10:07:58.967
  4. Time taken: 0.049 seconds, Fetched: 1 row(s)
  1. 获取年份year、月份month、日期day
  1. hive> select year(current_date()), month(current_date()), day(current_date());
  2. OK
  3. 2020 11 2
  4. Time taken: 0.054 seconds, Fetched: 1 row(s)
  1. 另外,yearcurrent_timestamp也能搭配使用:
  1. hive> select year(current_timestamp()), month(current_timestamp()), day(current_timestamp());
  2. OK
  3. 2020 11 2
  4. Time taken: 0.042 seconds, Fetched: 1 row(s)
  1. 返回日期部分to_date
  1. hive> select to_date(current_timestamp());
  2. OK
  3. 2020-11-02
  4. Time taken: 0.051 seconds, Fetched: 1 row(s)
  • 条件函数的作用和java中的switch类似,语法是case X when XX then XXX else XXXX end
  • 示例如下,作用是判断name字段,如果等于tom就返回tom_case,如果等于jerry就返回jerry_case,其他情况都返回other_case
  1. select name,
  2. case name when 'tom' then 'tom_case'
  3. when 'jerry' then 'jerry_case'
  4. else 'other_case'
  5. end
  6. from student;

结果如下:

  1. hive> select name,
  2. > case name when 'tom' then 'tom_case'
  3. > when 'jerry' then 'jerry_case'
  4. > else 'other_case'
  5. > end
  6. > from student;
  7. OK
  8. tom tom_case
  9. jerry jerry_case
  10. mike other_case
  11. john other_case
  12. mary other_case
  13. Time taken: 0.08 seconds, Fetched: 5 row(s)
  1. 返回行数count
  1. select count(*) from student;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 2 seconds 170 msec
  2. OK
  3. 5
  4. Time taken: 20.823 seconds, Fetched: 1 row(s)
  1. 分组后组内求和sum
  1. select province, sum(1) from address group by province;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 1 seconds 870 msec
  2. OK
  3. guangdong 2
  4. jiangshu 1
  5. shanxi 2
  6. Time taken: 19.524 seconds, Fetched: 3 row(s)
  1. 分组后,组内最小值min,最大值max,平均值avg
  1. select province, min(addressid), max(addressid), avg(addressid) from address group by province;

触发MR,结果如下:

  1. Total MapReduce CPU Time Spent: 1 seconds 650 msec
  2. OK
  3. guangdong 1 2 1.5
  4. jiangshu 6 6 6.0
  5. shanxi 3 4 3.5
  6. Time taken: 20.106 seconds, Fetched: 3 row(s)
  • 至此,hive常用到内置函数咱们都体验过一遍了,希望能给您提供一些参考,接下来的文章会体验一个常用工具:Sqoop
  1. Java系列
  2. Spring系列
  3. Docker系列
  4. kubernetes系列
  5. 数据库+中间件系列
  6. DevOps系列

微信搜索「程序员欣宸」,我是欣宸,期待与您一同畅游Java世界…
https://github.com/zq2599/blog_demos

版权声明:本文为bolingcavalry原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/bolingcavalry/p/14975212.html