一.数据

  这是一个Udacity纳米学位项目(数据科学的顶点)。这个项目使用来自Sparkify的用户事件数据来建立一个模型来预测用户的流失。Sparkify是一个类似于Spotify或Pandora的数字化音乐服务(类似于网易云音乐和QQ音乐的音乐平台)。使用Spark分析探索某数字音乐服务平台Sparkify(类似于网易云音乐和QQ音乐的音乐平台)2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析,提取可能对预测用户是否流失有帮助的相关特征,从而建立流失用户预测模型。

以下是APP的日志数据,包含226个不同用户的信息,从为一首歌的点赞到更改账户设置,都有详细的操作。

数据集名称:mini_sparkify_event_data.json,原始数据12G,这里使用子集123M,可自行网上搜索下载。
数据描述:

artist    音乐信息    歌手名称
auth    网页信息    用户进入平台的方式
firstName    用户信息    用户的名
gender    用户信息    用户性别:F为女,M为男
itemInSession    网页信息    会话顺序
lastName    用户信息    用户的姓氏
length    音乐信息    音乐时长(秒)
level    Event    用户等级:free为免费用户,paid为付费用户
location    会话信息    用户在会话期间所属位置
method    网页信息    HTTP method ,GET 或者 PUT
page    网页信息    用户行为类型
registration    用户信息    用户注册时间
sessionId    会话信息    会话编号
song    音乐信息    歌曲名称
status    网页信息    HTTP状态编码. 2xx=Successful, 3xx=Redirection, 4xx=Client Error.
ts    网页信息    用户行为发生的时间
userAgent    会话信息    网络环境,所属浏览器
userId    用户信息    用户编码,具有唯一性

 

查看属性:

         +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         |          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
         +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         |  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|
         |Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|              Canada|   200|1538352180000|"Mozilla/5.0 (Win...|     9|
         |    Adam Lambert|Logged In|    Colin|     M|           51| Freeman| 282.8273| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|   Time For Miracles|   200|1538352394000|Mozilla/5.0 (Wind...|    30|
         |          Enigma|Logged In|    Micah|     M|           80|    Long|262.71302| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|Knocking On Forbi...|   200|1538352416000|"Mozilla/5.0 (Win...|     9|
         |       Daft Punk|Logged In|    Colin|     M|           52| Freeman|223.60771| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|Harder Better Fas...|   200|1538352676000|Mozilla/5.0 (Wind...|    30|
         +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         

 

二.代码

详细代码见:RF用户流失预测(https://github.com/jiangnanboy/spark_tutorial)

public static void churnAnalysis(SparkSession session) {
        /**
         * 加载数据
         * 这是日志数据,包含226个不同用户的信息,从为一首歌点赞到更改账户设置,都有详细的操作。
         *
         * 数据描述
         * artist: Artist name (ex. Daft Punk)
         * auth: User authentication status (ex. Logged)
         * firstName: User first name (ex. Colin)
         * gender: Gender (ex. F or M)
         * itemInSession: Item count in a session (ex. 52)
         * lastName: User last name (ex. Freeman)
         * length: Length of song (ex. 223.60771)
         * level: User plan (ex. paid)
         * location: User\'s location (ex. Bakersfield)
         * method: HTTP method (ex. PUT)
         * page: Page name (ex. NextSong)
         * registration: Registration timestamp (unix timestamp) (ex. 1538173362000)
         * sessionId: Session ID (ex. 29)
         * song: Song (ex. Harder Better Faster Stronger)
         * status: HTTP status (ex. 200)
         * ts: Event timestamp(unix timestamp) (ex. 1538352676000)
         * userAgent: User\'s browswer agent (ex. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0)
         * userId: User ID (ex. 30)
         *
         * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         * |          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
         * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         * |  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|
         * |Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|              Canada|   200|1538352180000|"Mozilla/5.0 (Win...|     9|
         * |    Adam Lambert|Logged In|    Colin|     M|           51| Freeman| 282.8273| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|   Time For Miracles|   200|1538352394000|Mozilla/5.0 (Wind...|    30|
         * |          Enigma|Logged In|    Micah|     M|           80|    Long|262.71302| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|Knocking On Forbi...|   200|1538352416000|"Mozilla/5.0 (Win...|     9|
         * |       Daft Punk|Logged In|    Colin|     M|           52| Freeman|223.60771| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|Harder Better Fas...|   200|1538352676000|Mozilla/5.0 (Wind...|    30|
         * +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
         */
        String path = PropertiesReader.get("mini_sparkify_event_data"); //数据集自行下载
        Dataset<Row> dataset = session.read().json(path);
        dataset.persist(StorageLevel.MEMORY_AND_DISK());
        dataset.show(5);
        /**
         *
         */

        //创建视图
        dataset.createOrReplaceTempView("userlogs");

        String[] columnsName = dataset.columns();
        Column[] columns = new Column[columnsName.length];
        for(int index = 0;index < columnsName.length; index++) {
            columns[index] = functions.count(functions.when(functions.isnull(col(columnsName[index])), columnsName[index])).as(columnsName[index]);
        }
        /**
         * 查看每个列为null的数量
         *
         * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
         * |artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId| song|status| ts|userAgent|userId|
         * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
         * | 58392|   0|     8346|  8346|            0|    8346| 58392|    0|    8346|     0|   0|        8346|        0|58392|     0|  0|     8346|     0|
         * +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
         */
        dataset.select(columns);

        /**
         * 列“firstName”中的缺失值
         * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
         * |artist|      auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status|           ts|userAgent|userId|
         * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
         * |  null|Logged Out|     null|  null|          100|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355745000|     null|      |
         * |  null|Logged Out|     null|  null|          101|    null|  null| free|    null|   GET| Help|        null|        8|null|   200|1538355807000|     null|      |
         * |  null|Logged Out|     null|  null|          102|    null|  null| free|    null|   GET| Home|        null|        8|null|   200|1538355841000|     null|      |
         * |  null|Logged Out|     null|  null|          103|    null|  null| free|    null|   PUT|Login|        null|        8|null|   307|1538355842000|     null|      |
         * |  null|Logged Out|     null|  null|            2|    null|  null| free|    null|   GET| Home|        null|      240|null|   200|1538356678000|     null|      |
         * |  null|Logged Out|     null|  null|            3|    null|  null| free|    null|   PUT|Login|        null|      240|null|   307|1538356679000|     null|      |
         * |  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      100|null|   307|1538358102000|     null|      |
         * |  null|Logged Out|     null|  null|            0|    null|  null| free|    null|   PUT|Login|        null|      241|null|   307|1538360117000|     null|      |
         * |  null|Logged Out|     null|  null|           14|    null|  null| free|    null|   GET| Home|        null|      187|null|   200|1538361527000|     null|      |
         * |  null|Logged Out|     null|  null|           15|    null|  null| free|    null|   PUT|Login|        null|      187|null|   307|1538361528000|     null|      |
         * +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
         */
        dataset.where(col("firstName").isNaN()).show(10);

        /**
         * 列"artist"中的缺失值
         * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
         * |artist|      auth|firstName|gender|itemInSession|lastName|length|level|            location|method|           page| registration|sessionId|song|status|           ts|           userAgent|userId|
         * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
         * |  null| Logged In|    Colin|     M|           54| Freeman|  null| paid|     Bakersfield, CA|   PUT|Add to Playlist|1538173362000|       29|null|   200|1538352905000|Mozilla/5.0 (Wind...|    30|
         * |  null| Logged In|    Micah|     M|           84|    Long|  null| free|Boston-Cambridge-...|   GET|    Roll Advert|1538331630000|        8|null|   200|1538353150000|"Mozilla/5.0 (Win...|     9|
         * |  null| Logged In|    Micah|     M|           86|    Long|  null| free|Boston-Cambridge-...|   PUT|      Thumbs Up|1538331630000|        8|null|   307|1538353376000|"Mozilla/5.0 (Win...|     9|
         * |  null| Logged In|    Alexi|     F|            4|  Warren|  null| paid|Spokane-Spokane V...|   GET|      Downgrade|1532482662000|       53|null|   200|1538354749000|Mozilla/5.0 (Wind...|    54|
         * |  null| Logged In|    Alexi|     F|            7|  Warren|  null| paid|Spokane-Spokane V...|   PUT|      Thumbs Up|1532482662000|       53|null|   307|1538355255000|Mozilla/5.0 (Wind...|    54|
         * |  null| Logged In|    Micah|     M|           95|    Long|  null| free|Boston-Cambridge-...|   PUT|    Thumbs Down|1538331630000|        8|null|   307|1538355306000|"Mozilla/5.0 (Win...|     9|
         * |  null| Logged In|    Micah|     M|           97|    Long|  null| free|Boston-Cambridge-...|   GET|           Home|1538331630000|        8|null|   200|1538355504000|"Mozilla/5.0 (Win...|     9|
         * |  null| Logged In|    Micah|     M|           99|    Long|  null| free|Boston-Cambridge-...|   PUT|         Logout|1538331630000|        8|null|   307|1538355687000|"Mozilla/5.0 (Win...|     9|
         * |  null| Logged In|  Ashlynn|     F|            9|Williams|  null| free|     Tallahassee, FL|   PUT|      Thumbs Up|1537365219000|      217|null|   307|1538355711000|"Mozilla/5.0 (Mac...|    74|
         * |  null|Logged Out|     null|  null|          100|    null|  null| free|                null|   GET|           Home|         null|        8|null|   200|1538355745000|                null|      |
         * +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
         */
        dataset.where(col("artist").isNaN()).show(10);

        /**
         * 过滤注销的用户
         */
        dataset = dataset.where(col("auth").notEqual("Logged Out"));

        /**通过以上分析可知:
         * 虽然在较高的级别上userId或sessionId列中没有丢失的值,但进一步查看firstName列中丢失的值,可以发现注销的用户拥有空(但不是null)的用户ID。排除这些用户。
         * 列artist中也有缺失的值,但这些值对应于与音乐无关的动作的日志(如“Add to Playlist”、“Roll Advert”等)。只要这些用户是登录的,我们就希望保留这个activity,因为他们可能是分类的重要行为标记。
         */


        /**
         *  |-- artist: string (nullable = true)
         *  |-- auth: string (nullable = true)
         *  |-- firstName: string (nullable = true)
         *  |-- gender: string (nullable = true)
         *  |-- itemInSession: long (nullable = true)
         *  |-- lastName: string (nullable = true)
         *  |-- length: double (nullable = true)
         *  |-- level: string (nullable = true)
         *  |-- location: string (nullable = true)
         *  |-- method: string (nullable = true)
         *  |-- page: string (nullable = true)
         *  |-- registration: long (nullable = true)
         *  |-- sessionId: long (nullable = true)
         *  |-- song: string (nullable = true)
         *  |-- status: long (nullable = true)
         *  |-- ts: long (nullable = true)
         *  |-- userAgent: string (nullable = true)
         *  |-- userId: string (nullable = true)
         */
        dataset.printSchema();

        /**
         * 对用户事件动作group,count
         *
         * +--------------------+------+
         * |                Page| count|
         * +--------------------+------+
         * |              Cancel|    52|
         * |    Submit Downgrade|    63|
         * |         Thumbs Down|  2546|
         * |                Home| 10118|
         * |           Downgrade|  2055|
         * |         Roll Advert|  3933|
         * |              Logout|  3226|
         * |       Save Settings|   310|
         * |Cancellation Conf...|    52|
         * |               About|   509|
         * | Submit Registration|     5|
         * |            Settings|  1514|
         * |            Register|    18|
         * |     Add to Playlist|  6526|
         * |          Add Friend|  4277|
         * |            NextSong|228108|
         * |           Thumbs Up| 12551|
         * |                Help|  1477|
         * |             Upgrade|   499|
         * |               Error|   253|
         * +--------------------+------+
         */
        dataset.groupBy("Page").count().show();

        /**
         * 不同的用户数量
         *+--------+
         * |nb_users|
         * +--------+
         * |     226|
         * +--------+
         */
        session.sql("select count(distinct userId) as nb_users from userlogs").show();

        /**
         * 创建数据集,包括用户id和标签(是否流失)
         *
         * 定义:列Page中值为"Cancellation Confirmation"为流失,其它非流失
         */

        Dataset<Row> churnDataset = session.sql("select distinct userId, 1 as churn from userlogs where Page=\'Cancellation Confirmation\'");
        Dataset<Row> noChurnDataset = session.sql("select distinct userId, 0 as churn from userlogs where userId not in (select distinct userId from userlogs where Page=\'Cancellation Confirmation\')");

        /**
         * union churnDataset  noChurnDataset,shuffling the rows
         */
        Dataset<Row> unionDataset = churnDataset.union(noChurnDataset);
        unionDataset.createOrReplaceTempView("churn");
        unionDataset = session.sql("select * from churn order by rand()");
        unionDataset.createOrReplaceTempView("churn");

        /**
         * churn    userId
         *  0         174
         *  1         52
         */
        unionDataset.groupBy(col("churn")).count().show();

        /**
         * 以上通过定义page为“Cancellation Confirmation”为可以流失用户。
         * 这种方法使我们能够在用户流失之前研究其行为,尝试建立预测模型并提取表明将来有流失风险的行为。
         * 从以上可知建立的数据集不平衡,在建立模型前可进行采样缓解这种不平衡带来的不准确问题。
         */
      ......

 

版权声明:本文为little-horse原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/little-horse/p/14018469.html