HUE 之 sparksql 配置及使用
1、环境说明:
HDP 2.4 V3 sandbox
hue 4.0.0
2、hue 4.0.0 编译及安装
地址:https://github.com/cloudera/hue/releases/tag/release-4.1.0(也许是发版这弄错了吧,连接是4.1.0,内容版本是4.0.0)
2.1 修改%HUE_CODE_HOME%/hue/maven/pom.xml版本,如下:
<hadoop-mr1.version>2.7.1</hadoop-mr1.version> <hadoop.version>2.7.1</hadoop.version> <spark.version>1.6.0</spark.version>
2.2 将hadoop-core修改为hadoop-common(core会报错找不到)
<artifactId>hadoop-common</artifactId>
2.3 将hadoop-test的版本改为1.2.1:
<artifactId>hadoop-test</artifactId><version>1.2.1</version>
2.4 删除多余文件,否则编译时会报错
将两个ThriftJobTrackerPlugin.java文件删除,分别在如下两个目录:
%HUE_CODE_HOME%/hue/desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/thriftfs/ThriftJobTrackerPlugin.java
%HUE_CODE_HOME%/hue/desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/mapred/ThriftJobTrackerPlugin.java
2.5 编译安装
PREFIX=/usr/local/hue-4.0.0-release/ make clean //指定要安装的目录
rm -rf /usr/local/hue-4.0.0-release/*
PREFIX=/usr/local/hue-4.0.0-release/ make install
3. spark thrift server 配置及启动
hdp 2.4 V3 的 spark thrift server 默认端口是10015,我们将此信息配置到 /usr/hdp/current/spark-thriftserver/conf/hive-site.xml中,如下:(我没找到在ambari启动spark thrift-server的入口,只能手动启动)
<configuration> <property> <name>hive.metastore.uris</name> <value>thrift://sandbox.hortonworks.com:9083</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10015</value> <description> Port number of HiveServer2 Thrift interface. Can be overridden by setting $HIVE_SERVER2_THRIFT_PORT </description> </property> <!-- <property> <name>hive.server2.thrift.bind.host</name> <value>localhost</value> <description> Bind host on which to run the HiveServer2 Thrift interface. Can be overridden by setting $HIVE_SERVER2_THRIFT_BIND_HOST </description> </property> --> </configuration>
配置完之后,启动thrift-server。
cd /usr/hdp/current/spark-thriftserver/ sbin/start-thriftserver.sh --master yarn --deploy-mode client
4 配置hue.ini (/usr/local/hue-4.0.0-release/hue/desktop/conf/hue.ini)
4.1 反注释 [[interpreters]] 下的 sparksql,如下:
[[interpreters]] # Define the name and how to connect and execute the language. [[[hive]]] # The name of the snippet. name=Hive # The backend connection to use to communicate with the server. interface=hiveserver2 [[[impala]]] name=Impala interface=hiveserver2 [[[sparksql]]] name=SparkSql interface=hiveserver2 [[[spark]]] name=Scala interface=livy [[[pyspark]]] name=PySpark interface=livy [[[r]]] name=R interface=livy [[[jar]]] name=Spark Submit Jar interface=livy-batch
4.2 配置spark 的livy server 如下:
###########################################################################
# Settings to configure the Spark application.
###########################################################################
[spark]
# Host address of the Livy Server.
livy_server_host=localhost
# Port of the Livy Server.
livy_server_port=8998
# Configure Livy to start in local 'process' mode, or 'yarn' workers.
livy_server_session_kind=yarn
# Whether Livy requires client to perform Kerberos authentication.
security_enabled=false
# Host of the Sql Server
sql_server_host=localhost
# Port of the Sql Server
sql_server_port=10015
注意:端口配置为spark-thrift server 端口10015
5 验证结果
5.1 确保spark-thrift server已经启动
cd /usr/hdp/current/spark-thriftserver/
sbin/start-thriftserver.sh --master yarn --deploy-mode client
5.2 启动hue
cd /usr/local/hue-4.0.0-release/hue/ build/env/bin/supervisor
5.3 登录hue,选择notebook-editor-sparksql,录入sql
5.4 打开yarn页面,可以看到当前有一个spark thrift server 的job。
5.5 执行5.3 的sql,点击5.4 job 右侧的applicationMaster ,进入spark页面,可以看到如下spark job。在stages页面,我们可以看到执行的sql,
5.6 待执行完成之后,查看hue页面,可以看到查到的数据如下:
至此,说明 hue发起的请求,spark thrift server 已经接收到,且能够正常执行。
6 额外说明:
细心的读者可能发现了,我们配置了livy server,但是却没有启动livy-server。
在此说明:spark sql 执行(使用 spark sql [[interpreters]] )的时候,不使用livy server。直接把sql 提交到了 spark 的thrift server 上,但是要读取livy server中的sql_server_port变量
只有在使用spark scala 这些interpreters的时候,才会用到 livy-server