spark之客户流失预测
一.数据
这是一个Udacity纳米学位项目(数据科学的顶点)。这个项目使用来自Sparkify的用户事件数据来建立一个模型来预测用户的流失。Sparkify是一个类似于Spotify或Pandora的数字化音乐服务(类似于网易云音乐和QQ音乐的音乐平台)。使用Spark分析探索某数字音乐服务平台Sparkify(类似于网易云音乐和QQ音乐的音乐平台)2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析,提取可能对预测用户是否流失有帮助的相关特征,从而建立流失用户预测模型。
以下是APP的日志数据,包含226个不同用户的信息,从为一首歌的点赞到更改账户设置,都有详细的操作。
数据集名称:mini_sparkify_event_data.json,原始数据12G,这里使用子集123M,可自行网上搜索下载。
数据描述:
artist 音乐信息 歌手名称 auth 网页信息 用户进入平台的方式 firstName 用户信息 用户的名 gender 用户信息 用户性别:F为女,M为男 itemInSession 网页信息 会话顺序 lastName 用户信息 用户的姓氏 length 音乐信息 音乐时长(秒) level Event 用户等级:free为免费用户,paid为付费用户 location 会话信息 用户在会话期间所属位置 method 网页信息 HTTP method ,GET 或者 PUT page 网页信息 用户行为类型 registration 用户信息 用户注册时间 sessionId 会话信息 会话编号 song 音乐信息 歌曲名称 status 网页信息 HTTP状态编码. 2xx=Successful, 3xx=Redirection, 4xx=Client Error. ts 网页信息 用户行为发生的时间 userAgent 会话信息 网络环境,所属浏览器 userId 用户信息 用户编码,具有唯一性
查看属性:
+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+ | artist| auth|firstName|gender|itemInSession|lastName| length|level| location|method| page| registration|sessionId| song|status| ts| userAgent|userId| +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+ | Martha Tilston|Logged In| Colin| M| 50| Freeman|277.89016| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Rockpools| 200|1538352117000|Mozilla/5.0 (Wind...| 30| |Five Iron Frenzy|Logged In| Micah| M| 79| Long|236.09424| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8| Canada| 200|1538352180000|"Mozilla/5.0 (Win...| 9| | Adam Lambert|Logged In| Colin| M| 51| Freeman| 282.8273| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Time For Miracles| 200|1538352394000|Mozilla/5.0 (Wind...| 30| | Enigma|Logged In| Micah| M| 80| Long|262.71302| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8|Knocking On Forbi...| 200|1538352416000|"Mozilla/5.0 (Win...| 9| | Daft Punk|Logged In| Colin| M| 52| Freeman|223.60771| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29|Harder Better Fas...| 200|1538352676000|Mozilla/5.0 (Wind...| 30| +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
二.代码
详细代码见:RF用户流失预测(https://github.com/jiangnanboy/spark_tutorial)
public static void churnAnalysis(SparkSession session) { /** * 加载数据 * 这是日志数据,包含226个不同用户的信息,从为一首歌点赞到更改账户设置,都有详细的操作。 * * 数据描述 * artist: Artist name (ex. Daft Punk) * auth: User authentication status (ex. Logged) * firstName: User first name (ex. Colin) * gender: Gender (ex. F or M) * itemInSession: Item count in a session (ex. 52) * lastName: User last name (ex. Freeman) * length: Length of song (ex. 223.60771) * level: User plan (ex. paid) * location: User\'s location (ex. Bakersfield) * method: HTTP method (ex. PUT) * page: Page name (ex. NextSong) * registration: Registration timestamp (unix timestamp) (ex. 1538173362000) * sessionId: Session ID (ex. 29) * song: Song (ex. Harder Better Faster Stronger) * status: HTTP status (ex. 200) * ts: Event timestamp(unix timestamp) (ex. 1538352676000) * userAgent: User\'s browswer agent (ex. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0) * userId: User ID (ex. 30) * * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+ * | artist| auth|firstName|gender|itemInSession|lastName| length|level| location|method| page| registration|sessionId| song|status| ts| userAgent|userId| * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+ * | Martha Tilston|Logged In| Colin| M| 50| Freeman|277.89016| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Rockpools| 200|1538352117000|Mozilla/5.0 (Wind...| 30| * |Five Iron Frenzy|Logged In| Micah| M| 79| Long|236.09424| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8| Canada| 200|1538352180000|"Mozilla/5.0 (Win...| 9| * | Adam Lambert|Logged In| Colin| M| 51| Freeman| 282.8273| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Time For Miracles| 200|1538352394000|Mozilla/5.0 (Wind...| 30| * | Enigma|Logged In| Micah| M| 80| Long|262.71302| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8|Knocking On Forbi...| 200|1538352416000|"Mozilla/5.0 (Win...| 9| * | Daft Punk|Logged In| Colin| M| 52| Freeman|223.60771| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29|Harder Better Fas...| 200|1538352676000|Mozilla/5.0 (Wind...| 30| * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+ */ String path = PropertiesReader.get("mini_sparkify_event_data"); //数据集自行下载 Dataset<Row> dataset = session.read().json(path); dataset.persist(StorageLevel.MEMORY_AND_DISK()); dataset.show(5); /** * */ //创建视图 dataset.createOrReplaceTempView("userlogs"); String[] columnsName = dataset.columns(); Column[] columns = new Column[columnsName.length]; for(int index = 0;index < columnsName.length; index++) { columns[index] = functions.count(functions.when(functions.isnull(col(columnsName[index])), columnsName[index])).as(columnsName[index]); } /** * 查看每个列为null的数量 * * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+ * |artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId| song|status| ts|userAgent|userId| * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+ * | 58392| 0| 8346| 8346| 0| 8346| 58392| 0| 8346| 0| 0| 8346| 0|58392| 0| 0| 8346| 0| * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+ */ dataset.select(columns); /** * 列“firstName”中的缺失值 * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+ * |artist| auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status| ts|userAgent|userId| * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+ * | null|Logged Out| null| null| 100| null| null| free| null| GET| Home| null| 8|null| 200|1538355745000| null| | * | null|Logged Out| null| null| 101| null| null| free| null| GET| Help| null| 8|null| 200|1538355807000| null| | * | null|Logged Out| null| null| 102| null| null| free| null| GET| Home| null| 8|null| 200|1538355841000| null| | * | null|Logged Out| null| null| 103| null| null| free| null| PUT|Login| null| 8|null| 307|1538355842000| null| | * | null|Logged Out| null| null| 2| null| null| free| null| GET| Home| null| 240|null| 200|1538356678000| null| | * | null|Logged Out| null| null| 3| null| null| free| null| PUT|Login| null| 240|null| 307|1538356679000| null| | * | null|Logged Out| null| null| 0| null| null| free| null| PUT|Login| null| 100|null| 307|1538358102000| null| | * | null|Logged Out| null| null| 0| null| null| free| null| PUT|Login| null| 241|null| 307|1538360117000| null| | * | null|Logged Out| null| null| 14| null| null| free| null| GET| Home| null| 187|null| 200|1538361527000| null| | * | null|Logged Out| null| null| 15| null| null| free| null| PUT|Login| null| 187|null| 307|1538361528000| null| | * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+ */ dataset.where(col("firstName").isNaN()).show(10); /** * 列"artist"中的缺失值 * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+ * |artist| auth|firstName|gender|itemInSession|lastName|length|level| location|method| page| registration|sessionId|song|status| ts| userAgent|userId| * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+ * | null| Logged In| Colin| M| 54| Freeman| null| paid| Bakersfield, CA| PUT|Add to Playlist|1538173362000| 29|null| 200|1538352905000|Mozilla/5.0 (Wind...| 30| * | null| Logged In| Micah| M| 84| Long| null| free|Boston-Cambridge-...| GET| Roll Advert|1538331630000| 8|null| 200|1538353150000|"Mozilla/5.0 (Win...| 9| * | null| Logged In| Micah| M| 86| Long| null| free|Boston-Cambridge-...| PUT| Thumbs Up|1538331630000| 8|null| 307|1538353376000|"Mozilla/5.0 (Win...| 9| * | null| Logged In| Alexi| F| 4| Warren| null| paid|Spokane-Spokane V...| GET| Downgrade|1532482662000| 53|null| 200|1538354749000|Mozilla/5.0 (Wind...| 54| * | null| Logged In| Alexi| F| 7| Warren| null| paid|Spokane-Spokane V...| PUT| Thumbs Up|1532482662000| 53|null| 307|1538355255000|Mozilla/5.0 (Wind...| 54| * | null| Logged In| Micah| M| 95| Long| null| free|Boston-Cambridge-...| PUT| Thumbs Down|1538331630000| 8|null| 307|1538355306000|"Mozilla/5.0 (Win...| 9| * | null| Logged In| Micah| M| 97| Long| null| free|Boston-Cambridge-...| GET| Home|1538331630000| 8|null| 200|1538355504000|"Mozilla/5.0 (Win...| 9| * | null| Logged In| Micah| M| 99| Long| null| free|Boston-Cambridge-...| PUT| Logout|1538331630000| 8|null| 307|1538355687000|"Mozilla/5.0 (Win...| 9| * | null| Logged In| Ashlynn| F| 9|Williams| null| free| Tallahassee, FL| PUT| Thumbs Up|1537365219000| 217|null| 307|1538355711000|"Mozilla/5.0 (Mac...| 74| * | null|Logged Out| null| null| 100| null| null| free| null| GET| Home| null| 8|null| 200|1538355745000| null| | * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+ */ dataset.where(col("artist").isNaN()).show(10); /** * 过滤注销的用户 */ dataset = dataset.where(col("auth").notEqual("Logged Out")); /**通过以上分析可知: * 虽然在较高的级别上userId或sessionId列中没有丢失的值,但进一步查看firstName列中丢失的值,可以发现注销的用户拥有空(但不是null)的用户ID。排除这些用户。 * 列artist中也有缺失的值,但这些值对应于与音乐无关的动作的日志(如“Add to Playlist”、“Roll Advert”等)。只要这些用户是登录的,我们就希望保留这个activity,因为他们可能是分类的重要行为标记。 */ /** * |-- artist: string (nullable = true) * |-- auth: string (nullable = true) * |-- firstName: string (nullable = true) * |-- gender: string (nullable = true) * |-- itemInSession: long (nullable = true) * |-- lastName: string (nullable = true) * |-- length: double (nullable = true) * |-- level: string (nullable = true) * |-- location: string (nullable = true) * |-- method: string (nullable = true) * |-- page: string (nullable = true) * |-- registration: long (nullable = true) * |-- sessionId: long (nullable = true) * |-- song: string (nullable = true) * |-- status: long (nullable = true) * |-- ts: long (nullable = true) * |-- userAgent: string (nullable = true) * |-- userId: string (nullable = true) */ dataset.printSchema(); /** * 对用户事件动作group,count * * +--------------------+------+ * | Page| count| * +--------------------+------+ * | Cancel| 52| * | Submit Downgrade| 63| * | Thumbs Down| 2546| * | Home| 10118| * | Downgrade| 2055| * | Roll Advert| 3933| * | Logout| 3226| * | Save Settings| 310| * |Cancellation Conf...| 52| * | About| 509| * | Submit Registration| 5| * | Settings| 1514| * | Register| 18| * | Add to Playlist| 6526| * | Add Friend| 4277| * | NextSong|228108| * | Thumbs Up| 12551| * | Help| 1477| * | Upgrade| 499| * | Error| 253| * +--------------------+------+ */ dataset.groupBy("Page").count().show(); /** * 不同的用户数量 *+--------+ * |nb_users| * +--------+ * | 226| * +--------+ */ session.sql("select count(distinct userId) as nb_users from userlogs").show(); /** * 创建数据集,包括用户id和标签(是否流失) * * 定义:列Page中值为"Cancellation Confirmation"为流失,其它非流失 */ Dataset<Row> churnDataset = session.sql("select distinct userId, 1 as churn from userlogs where Page=\'Cancellation Confirmation\'"); Dataset<Row> noChurnDataset = session.sql("select distinct userId, 0 as churn from userlogs where userId not in (select distinct userId from userlogs where Page=\'Cancellation Confirmation\')"); /** * union churnDataset noChurnDataset,shuffling the rows */ Dataset<Row> unionDataset = churnDataset.union(noChurnDataset); unionDataset.createOrReplaceTempView("churn"); unionDataset = session.sql("select * from churn order by rand()"); unionDataset.createOrReplaceTempView("churn"); /** * churn userId * 0 174 * 1 52 */ unionDataset.groupBy(col("churn")).count().show(); /** * 以上通过定义page为“Cancellation Confirmation”为可以流失用户。 * 这种方法使我们能够在用户流失之前研究其行为,尝试建立预测模型并提取表明将来有流失风险的行为。 * 从以上可知建立的数据集不平衡,在建立模型前可进行采样缓解这种不平衡带来的不准确问题。 */
......