2018 年 FIFA 世界杯即将拉开帷幕,全世界的球迷都热切地想要知道:谁将获得那梦寐以求的大力神杯?如果你不仅是个足球迷,而且也是科技人员的话,我猜你肯定知道机器学习和人工智能也是目前的流行词。让我们结合两者来预测一下本届俄罗斯 FIFA 世界杯哪个国家将夺冠。

原文来自CSDN,公众号ID:CSDNnews,对其结构略作改动。

 

2018 年 FIFA 世界杯即将拉开帷幕,全世界的球迷都热切地想要知道:谁将获得那梦寐以求的大力神杯?如果你不仅是个足球迷,而且也是高科技人员的话,我猜你肯定知道机器学习和人工智能也是目前的流行词。让我们结合两者来预测一下本届俄罗斯 FIFA 世界杯哪个国家将夺冠。

作者:Gerald Muriuki,经济、数据科学专家

译者:弯月,责编:郭芮

点击此处获取完整的代码:https://github.com/itsmuriuki/FIFA-2018-World-cup-predictions

足球比赛涉及的因素非常繁多,我无法将所有因素都融入机器学习模型中。本文只是一个黑客想用数据尝试一些很酷的东西。本文的目标是:

  1. 用机器学习来预测谁将赢得2018 FIFA世界杯的冠军;

  2. 预测整个比赛的小组赛结果;

  3. 模拟四分之一决赛、半决赛以及决赛。

这些目标代表了独一无二的现实世界里机器学习的预测问题,并将解决机器学习中的各种任务:数据集成、特征建模和结果预测。

我采用了两个来自 Kaggle 的数据集,我们将使用自 1930 年第一届世界杯以来所有参赛队的历史赛事结果。

FIFA 排名是于 90 年代创建的,因此这里缺失很大一部分数据,所以我们使用历史比赛记录。点击以下链接获取所有数据 :

首先,我们要针对两个数据集做探索性分析,然后经过特征工程来选择与预测关联性最强的特征,还有数据处理,再选择一个机器学习模型,最后将模型配置到数据集上。

首先,导入所需的代码库,并将数据集加载到数据框中:

  1. 1 import pandas as pd
  2. 2 import numpy as np
  3. 3 import matplotlib.pyplot as plt
  4. 4 import seaborn as sns
  5. 5 import matplotlib.ticker as ticker
  6. 6 import matplotlib.ticker as plticker
  7. 7 from sklearn.model_selection import train_test_split
  8. 8 from sklearn.linear_model import LogisticRegression

导入数据集:

  1. 1 #load data
  2. 2 world_cup = pd.read_csv(\'C:\Coding\FIFA2018-World-cup\datasets\World Cup 2018 Dataset.csv\')
  3. 3 results = pd.read_csv(\'C:/Coding/FIFA2018-World-cup/datasets/results.csv\')

下一步是加载数据集。通过调用 world_cup.head() 和 results.head() ,务必将两个数据集都加载到数据框中,如下所示:

在分析了两组数据集后,所得的数据集包含了以往赛事的数据——这个新的(所得的)数据集对于分析和预测将来的赛事非常有帮助。

探索性分析和特征工程:需要建立与机器学习模型相关的特征,在任何数据科学的项目中,这部分工作都是最耗时的。

现在我们把目标差异和结果列添加到结果数据集:

  1. 1 #Adding goal difference and establishing who is the winner
  2. 2 winner = []
  3. 3 for i in range (len(results[\'home_team\'])):
  4. 4 if results [\'home_score\'][i] > results[\'away_score\'][i]:
  5. 5 winner.append(results[\'home_team\'][i])
  6. 6 elif results[\'home_score\'][i] < results [\'away_score\'][i]:
  7. 7 winner.append(results[\'away_team\'][i])
  8. 8 else:
  9. 9 winner.append(\'Draw\')
  10. 10 results[\'winning_team\'] = winner
  11. 11
  12. 12 #adding goal difference column
  13. 13 results[\'goal_difference\'] = np.absolute(results[\'home_score\'] - results[\'away_score\'])
  14. 14
  15. 15 results.head()

检查一下新的结果数据框:

然后我们着手处理仅包含尼日利亚参加比赛的一组数据(这可以帮助我们集中找出哪些特征对一个国家有效,随后再扩展到参与世界杯的所有国家):

  1. 1 #lets work with a subset of the data one that includes games played by Nigeria in a Nigeria dataframe
  2. 2 df = results[(results[\'home_team\'] == \'Nigeria\') | (results[\'away_team\'] == \'Nigeria\')]
  3. 3 nigeria = df.iloc[:]
  4. 4 nigeria.head()

第一届世界杯于 1930 年举行。我们为年份创建一列,并选择所有 1930 年之后举行的比赛:

  1. 1 #creating a column for year and the first world cup was held in 1930
  2. 2 year = []
  3. 3 for row in nigeria[\'date\']:
  4. 4 year.append(int(row[:4]))
  5. 5 nigeria [\'match_year\']= year
  6. 6 nigeria_1930 = nigeria[nigeria.match_year >= 1930]
  7. 7 nigeria_1930.count()

现在我们可以用图形表示这些年来尼日利亚队最普遍的比赛结果:

  1. #what is the common game outcome for nigeria visualisation
  2. wins = []
  3. for row in nigeria_1930[\'winning_team\']:
  4. if row != \'Nigeria\' and row != \'Draw\':
  5. wins.append(\'Loss\')
  6. else:
  7. wins.append(row)
  8. winsdf= pd.DataFrame(wins, columns=[ \'Nigeria_Results\'])
  9. #plotting
  10. fig, ax = plt.subplots(1)
  11. fig.set_size_inches(10.7, 6.27)
  12. sns.set(style=\'darkgrid\')
  13. sns.countplot(x=\'Nigeria_Results\', data=winsdf)

每个参加世界杯的国家的胜率是非常有帮助性的指标,我们可以用它来预测此次比赛最可能的结果。

我们为2018世界杯所有参赛队伍创建一个数据框,然后从该数据框中进一步筛选出从 1930 年起参加世界杯的队伍,并去掉重复的队伍:

  1. 1 #narrowing to team patcipating in the world cup
  2. 2 worldcup_teams = [\'Australia\', \' Iran\', \'Japan\', \'Korea Republic\',
  3. 3 \'Saudi Arabia\', \'Egypt\', \'Morocco\', \'Nigeria\',
  4. 4 \'Senegal\', \'Tunisia\', \'Costa Rica\', \'Mexico\',
  5. 5 \'Panama\', \'Argentina\', \'Brazil\', \'Colombia\',
  6. 6 \'Peru\', \'Uruguay\', \'Belgium\', \'Croatia\',
  7. 7 \'Denmark\', \'England\', \'France\', \'Germany\',
  8. 8 \'Iceland\', \'Poland\', \'Portugal\', \'Russia\',
  9. 9 \'Serbia\', \'Spain\', \'Sweden\', \'Switzerland\']
  10. 10 df_teams_home = results[results[\'home_team\'].isin(worldcup_teams)]
  11. 11 df_teams_away = results[results[\'away_team\'].isin(worldcup_teams)]
  12. 12 df_teams = pd.concat((df_teams_home, df_teams_away))
  13. 13 df_teams.drop_duplicates()
  14. 14 df_teams.count()

为年份创建一列,去掉 1930 年之前的比赛,并去掉不会影响到比赛结果的数据列,比如 date(日期)、home_score(主场得分)、away_score(客场得分)、tournament(锦标赛)、city(城市)、country(国家)、goal_difference(目标差异)和 match_year(比赛年份):

  1. #create an year column to drop games before 1930
  2. year = []
  3. for row in df_teams[\'date\']:
  4. year.append(int(row[:4]))
  5. df_teams[\'match_year\'] = year
  6. df_teams_1930 = df_teams[df_teams.match_year >= 1930]
  7. df_teams_1930.head()
  8. #dropping columns that wll not affect matchoutcomes
  9. df_teams_1930 = df_teams.drop([\'date\', \'home_score\', \'away_score\', \'tournament\', \'city\', \'country\', \'goal_difference\', \'match_year\'], axis=1)
  10. df_teams_1930.head()

为了简化模型的处理,我们修改一下预测标签。

如果主场队伍获胜,那么 winning_team(获胜队伍)一列显示“2”,如果平局则显示“1”,如果是客场队伍获胜则显示“0”:

  1. 1 #Building the model
  2. 2 #the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.
  3. 3
  4. 4 df_teams_1930 = df_teams_1930.reset_index(drop=True)
  5. 5 df_teams_1930.loc[df_teams_1930.winning_team == df_teams_1930.home_team,\'winning_team\']=2
  6. 6 df_teams_1930.loc[df_teams_1930.winning_team == \'Draw\', \'winning_team\']=1
  7. 7 df_teams_1930.loc[df_teams_1930.winning_team == df_teams_1930.away_team, \'winning_team\']=0
  8. 8
  9. 9 df_teams_1930.head()

通过设置哑变量(dummy variables),我们将 home_team(主场队伍)和away _team(客场队伍)从分类变量转换成连续的输入。

这时可以使用 pandas 的 get_dummies() 函数,它会将分类列替换成一位有效值(one-hot,由数字‘1’和‘0’组成),以便将它们加载到 Scikit-learn 模型中。

接下来,我们将数据按照 70% 的训练数据集和 30% 的测试数据集分成 X 集和 Y 集:

  1. 1 #convert home team and away team from categorical variables to continous inputs
  2. 2 # Get dummy variables
  3. 3 final = pd.get_dummies(df_teams_1930, prefix=[\'home_team\', \'away_team\'], columns=[\'home_team\', \'away_team\'])
  4. 4
  5. 5 # Separate X and y sets
  6. 6 X = final.drop([\'winning_team\'], axis=1)
  7. 7 y = final["winning_team"]
  8. 8 y = y.astype(\'int\')
  9. 9
  10. 10 # Separate train and test sets
  11. 11 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

这里我们将使用分类算法:逻辑回归。这个算法的工作原理是什么?该算法利用逻辑函数来预测概率,从而可以测量出分类因变量与一个或多个自变量之间的关系。具体来说就是累积的逻辑分布。

换句话说,逻辑回归可以针对一组可以影响到结果的既定数据集(统计值)尝试预测结果(赢或输)。

在实践中这种方法的工作原理是:使用上述的两套“数据集”和比赛的实际结果,一次输入一场比赛到算法中。然后模型就会学习输入的每条数据对比赛结果产生了积极的效果还是消极的效果,以及影响的程度。

经过充分的(好)数据的训练后,就可以得到能够预测未来结果的模型,而模型的好坏程度取决于输入的数据。

之后我们将这些数据传递到算法中:

  1. logreg = LogisticRegression()
  2. logreg.fit(X_train, y_train)
  3. score = logreg.score(X_train, y_train)
  4. score2 = logreg.score(X_test, y_test)
  5. print("Training set accuracy: ", \'%.3f\'%(score))
  6. print("Test set accuracy: ", \'%.3f\'%(score2))
  1. Training set accuracy: 0.573
  2. Test set accuracy: 0.551

我们的模型子训练数据集的正确率为 57%,在测试数据集上的正确率为 55%。虽然结果不是很好,但是我们先继续下一步。

接下来我们建立需要配置到模型的数据框。

首先我们加载 2018 年 4 月 FIFA 排名数据和小组赛分组状况的数据集。由于世界杯比赛中没有“主场”和“客场”,所以我们把 FIFA 排名靠前的队伍作为“喜爱”的比赛队伍,将他们放到“home_teams”(主场队伍)一列。然后我们根据每个队伍的排名将他们加入到新的预测数据集中。下一步是创建默认变量,并部署机器学习模型。

 

2018 年 4 月 FIFA 排名数据:https://us.soccerway.com/teams/rankings/fifa/?ICID=TN_03_05_01

小组赛分组状况的数据集:https://fixturedownload.com/results/fifa-world-cup-2018

 

 

  1. #adding Fifa rankings
  2. #the team which is positioned higher on the FIFA Ranking will be considered "favourite" for the match
  3. #and therefore, will be positioned under the "home_teams" column
  4. #since there are no "home" or "away" teams in World Cup games.
  5.  
  6. # Loading new datasets
  7. ranking = pd.read_csv(\'C:/Coding/FIFA2018-World-cup/datasets/fifa_rankings.csv\')
  8. fixtures = pd.read_csv(\'C:/Coding/FIFA2018-World-cup/datasets/fixtures.csv\')
  9. # List for storing the group stage games
  10. pred_set = []
  11. # Create new columns with ranking position of each team
  12. fixtures.insert(1, \'first_position\', fixtures[\'Home Team\'].map(ranking.set_index(\'Team\')[\'Position\']))
  13. fixtures.insert(2, \'second_position\', fixtures[\'Away Team\'].map(ranking.set_index(\'Team\')[\'Position\']))
  14. # We only need the group stage games, so we have to slice the dataset
  15. fixtures = fixtures.iloc[:48, :]
  16. # Loop to add teams to new prediction dataset based on the ranking position of each team
  17. for index, row in fixtures.iterrows():
  18. if row[\'first_position\'] < row[\'second_position\']:
  19. pred_set.append({\'home_team\': row[\'Home Team\'], \'away_team\': row[\'Away Team\'], \'winning_team\': None})
  20. else:
  21. pred_set.append({\'home_team\': row[\'Away Team\'], \'away_team\': row[\'Home Team\'], \'winning_team\': None})
  22. pred_set = pd.DataFrame(pred_set)
  23. backup_pred_set = pred_set
  24. # Get dummy variables and drop winning_team column
  25. pred_set = pd.get_dummies(pred_set, prefix=[\'home_team\', \'away_team\'], columns=[\'home_team\', \'away_team\'])
  26. # Add missing columns compared to the model\'s training dataset
  27. missing_cols = set(final.columns) - set(pred_set.columns)
  28. for c in missing_cols:
  29. pred_set[c] = 0
  30. pred_set = pred_set[final.columns]
  31. # Remove winning team column
  32. pred_set = pred_set.drop([\'winning_team\'], axis=1)
  33. pred_set.head()

 

首先,我们将模型部署到小组赛中:

  1. #group matches
  2. predictions = logreg.predict(pred_set)
  3. for i in range(fixtures.shape[0]):
  4. print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
  5. if predictions[i] == 2:
  6. print("Winner: " + backup_pred_set.iloc[i, 1])
  7. elif predictions[i] == 1:
  8. print("Draw")
  9. elif predictions[i] == 0:
  10. print("Winner: " + backup_pred_set.iloc[i, 0])
  11. print(\'Probability of \' + backup_pred_set.iloc[i, 1] + \' winning: \', \'%.3f\'%(logreg.predict_proba(pred_set)[i][2]))
  12. print(\'Probability of Draw: \', \'%.3f\'%(logreg.predict_proba(pred_set)[i][1]))
  13. print(\'Probability of \' + backup_pred_set.iloc[i, 0] + \' winning: \', \'%.3f\'%(logreg.predict_proba(pred_set)[i][0]))
  14. print("")
  1. Russia and Saudi Arabia
  2. Winner: Russia
  3. Probability of Russia winning: 0.667
  4. Probability of Draw: 0.223
  5. Probability of Saudi Arabia winning: 0.111
  6.  
  7. Uruguay and Egypt
  8. Winner: Uruguay
  9. Probability of Uruguay winning: 0.583
  10. Probability of Draw: 0.352
  11. Probability of Egypt winning: 0.065
  12.  
  13. Iran and Morocco
  14. Draw
  15. Probability of Iran winning: 0.217
  16. Probability of Draw: 0.407
  17. Probability of Morocco winning: 0.376
  18.  
  19. Portugal and Spain
  20. Winner: Spain
  21. Probability of Portugal winning: 0.302
  22. Probability of Draw: 0.344
  23. Probability of Spain winning: 0.354
  24.  
  25. France and Australia
  26. Winner: France
  27. Probability of France winning: 0.628
  28. Probability of Draw: 0.227
  29. Probability of Australia winning: 0.145
  30.  
  31. Argentina and Iceland
  32. Winner: Argentina
  33. Probability of Argentina winning: 0.803
  34. Probability of Draw: 0.161
  35. Probability of Iceland winning: 0.036
  36.  
  37. Peru and Denmark
  38. Winner: Peru
  39. Probability of Peru winning: 0.439
  40. Probability of Draw: 0.171
  41. Probability of Denmark winning: 0.391
  42.  
  43. Croatia and Nigeria
  44. Winner: Croatia
  45. Probability of Croatia winning: 0.590
  46. Probability of Draw: 0.258
  47. Probability of Nigeria winning: 0.152
  48.  
  49. Costa Rica and Serbia
  50. Winner: Serbia
  51. Probability of Costa Rica winning: 0.315
  52. Probability of Draw: 0.324
  53. Probability of Serbia winning: 0.361
  54.  
  55. Germany and Mexico
  56. Winner: Germany
  57. Probability of Germany winning: 0.567
  58. Probability of Draw: 0.282
  59. Probability of Mexico winning: 0.150
  60.  
  61. Brazil and Switzerland
  62. Winner: Brazil
  63. Probability of Brazil winning: 0.775
  64. Probability of Draw: 0.138
  65. Probability of Switzerland winning: 0.087
  66.  
  67. Sweden and Korea Republic
  68. Winner: Sweden
  69. Probability of Sweden winning: 0.503
  70. Probability of Draw: 0.329
  71. Probability of Korea Republic winning: 0.168
  72.  
  73. Belgium and Panama
  74. Winner: Belgium
  75. Probability of Belgium winning: 0.765
  76. Probability of Draw: 0.145
  77. Probability of Panama winning: 0.090
  78.  
  79. England and Tunisia
  80. Winner: England
  81. Probability of England winning: 0.649
  82. Probability of Draw: 0.292
  83. Probability of Tunisia winning: 0.059
  84.  
  85. Colombia and Japan
  86. Winner: Colombia
  87. Probability of Colombia winning: 0.511
  88. Probability of Draw: 0.210
  89. Probability of Japan winning: 0.280
  90.  
  91. Poland and Senegal
  92. Winner: Poland
  93. Probability of Poland winning: 0.612
  94. Probability of Draw: 0.223
  95. Probability of Senegal winning: 0.165
  96.  
  97. Egypt and Russia
  98. Winner: Russia
  99. Probability of Egypt winning: 0.225
  100. Probability of Draw: 0.297
  101. Probability of Russia winning: 0.478
  102.  
  103. Portugal and Morocco
  104. Winner: Portugal
  105. Probability of Portugal winning: 0.486
  106. Probability of Draw: 0.377
  107. Probability of Morocco winning: 0.138
  108.  
  109. Uruguay and Saudi Arabia
  110. Winner: Uruguay
  111. Probability of Uruguay winning: 0.668
  112. Probability of Draw: 0.259
  113. Probability of Saudi Arabia winning: 0.073
  114.  
  115. Spain and Iran
  116. Winner: Spain
  117. Probability of Spain winning: 0.695
  118. Probability of Draw: 0.247
  119. Probability of Iran winning: 0.058
  120.  
  121. Denmark and Australia
  122. Winner: Denmark
  123. Probability of Denmark winning: 0.551
  124. Probability of Draw: 0.241
  125. Probability of Australia winning: 0.207
  126.  
  127. France and Peru
  128. Winner: France
  129. Probability of France winning: 0.635
  130. Probability of Draw: 0.215
  131. Probability of Peru winning: 0.150
  132.  
  133. Argentina and Croatia
  134. Winner: Argentina
  135. Probability of Argentina winning: 0.599
  136. Probability of Draw: 0.255
  137. Probability of Croatia winning: 0.146
  138.  
  139. Brazil and Costa Rica
  140. Winner: Brazil
  141. Probability of Brazil winning: 0.800
  142. Probability of Draw: 0.147
  143. Probability of Costa Rica winning: 0.053
  144.  
  145. Iceland and Nigeria
  146. Winner: Nigeria
  147. Probability of Iceland winning: 0.278
  148. Probability of Draw: 0.248
  149. Probability of Nigeria winning: 0.474
  150.  
  151. Switzerland and Serbia
  152. Winner: Switzerland
  153. Probability of Switzerland winning: 0.402
  154. Probability of Draw: 0.228
  155. Probability of Serbia winning: 0.370
  156.  
  157. Belgium and Tunisia
  158. Winner: Belgium
  159. Probability of Belgium winning: 0.619
  160. Probability of Draw: 0.253
  161. Probability of Tunisia winning: 0.128
  162.  
  163. Mexico and Korea Republic
  164. Winner: Mexico
  165. Probability of Mexico winning: 0.504
  166. Probability of Draw: 0.327
  167. Probability of Korea Republic winning: 0.169
  168.  
  169. Germany and Sweden
  170. Winner: Germany
  171. Probability of Germany winning: 0.571
  172. Probability of Draw: 0.228
  173. Probability of Sweden winning: 0.201
  174.  
  175. England and Panama
  176. Winner: England
  177. Probability of England winning: 0.781
  178. Probability of Draw: 0.178
  179. Probability of Panama winning: 0.041
  180.  
  181. Senegal and Japan
  182. Winner: Senegal
  183. Probability of Senegal winning: 0.397
  184. Probability of Draw: 0.278
  185. Probability of Japan winning: 0.325
  186.  
  187. Poland and Colombia
  188. Draw
  189. Probability of Poland winning: 0.379
  190. Probability of Draw: 0.391
  191. Probability of Colombia winning: 0.230
  192.  
  193. Uruguay and Russia
  194. Winner: Uruguay
  195. Probability of Uruguay winning: 0.403
  196. Probability of Draw: 0.388
  197. Probability of Russia winning: 0.209
  198.  
  199. Egypt and Saudi Arabia
  200. Winner: Egypt
  201. Probability of Egypt winning: 0.544
  202. Probability of Draw: 0.216
  203. Probability of Saudi Arabia winning: 0.240
  204.  
  205. Portugal and Iran
  206. Winner: Portugal
  207. Probability of Portugal winning: 0.548
  208. Probability of Draw: 0.353
  209. Probability of Iran winning: 0.099
  210.  
  211. Spain and Morocco
  212. Winner: Spain
  213. Probability of Spain winning: 0.650
  214. Probability of Draw: 0.267
  215. Probability of Morocco winning: 0.083
  216.  
  217. France and Denmark
  218. Winner: France
  219. Probability of France winning: 0.621
  220. Probability of Draw: 0.159
  221. Probability of Denmark winning: 0.220
  222.  
  223. Peru and Australia
  224. Winner: Peru
  225. Probability of Peru winning: 0.463
  226. Probability of Draw: 0.250
  227. Probability of Australia winning: 0.288
  228.  
  229. Argentina and Nigeria
  230. Winner: Argentina
  231. Probability of Argentina winning: 0.708
  232. Probability of Draw: 0.222
  233. Probability of Nigeria winning: 0.070
  234.  
  235. Croatia and Iceland
  236. Winner: Croatia
  237. Probability of Croatia winning: 0.734
  238. Probability of Draw: 0.185
  239. Probability of Iceland winning: 0.080
  240.  
  241. Mexico and Sweden
  242. Winner: Mexico
  243. Probability of Mexico winning: 0.465
  244. Probability of Draw: 0.264
  245. Probability of Sweden winning: 0.271
  246.  
  247. Germany and Korea Republic
  248. Winner: Germany
  249. Probability of Germany winning: 0.598
  250. Probability of Draw: 0.282
  251. Probability of Korea Republic winning: 0.120
  252.  
  253. Brazil and Serbia
  254. Winner: Brazil
  255. Probability of Brazil winning: 0.714
  256. Probability of Draw: 0.165
  257. Probability of Serbia winning: 0.120
  258.  
  259. Switzerland and Costa Rica
  260. Winner: Switzerland
  261. Probability of Switzerland winning: 0.587
  262. Probability of Draw: 0.213
  263. Probability of Costa Rica winning: 0.200
  264.  
  265. Poland and Japan
  266. Winner: Poland
  267. Probability of Poland winning: 0.551
  268. Probability of Draw: 0.242
  269. Probability of Japan winning: 0.206
  270.  
  271. Colombia and Senegal
  272. Winner: Colombia
  273. Probability of Colombia winning: 0.577
  274. Probability of Draw: 0.194
  275. Probability of Senegal winning: 0.229
  276.  
  277. Tunisia and Panama
  278. Winner: Tunisia
  279. Probability of Tunisia winning: 0.631
  280. Probability of Draw: 0.257
  281. Probability of Panama winning: 0.113
  282.  
  283. Belgium and England
  284. Winner: England
  285. Probability of Belgium winning: 0.273
  286. Probability of Draw: 0.235
  287. Probability of England winning: 0.492

之后进行16强的模拟:

  1. # List of tuples before
  2. group_16 = [(\'Uruguay\', \'Portugal\'),
  3. (\'France\', \'Croatia\'),
  4. (\'Brazil\', \'Mexico\'),
  5. (\'England\', \'Colombia\'),
  6. (\'Spain\', \'Russia\'),
  7. (\'Argentina\', \'Peru\'),
  8. (\'Germany\', \'Switzerland\'),
  9. (\'Poland\', \'Belgium\')]
    def clean_and_predict(matches, ranking, final, logreg):

        # Initialization of auxiliary list for data cleaning
        positions = []

        # Loop to retrieve each team\'s position according to FIFA ranking
        for match in matches:
            positions.append(ranking.loc[ranking[\'Team\'] == match[0],\'Position\'].iloc[0])
            positions.append(ranking.loc[ranking[\'Team\'] == match[1],\'Position\'].iloc[0])
        
        # Creating the DataFrame for prediction
        pred_set = []

        # Initializing iterators for while loop
        i = 0
        j = 0

        # \'i\' will be the iterator for the \'positions\' list, and \'j\' for the list of matches (list of tuples)
        while i < len(positions):
            dict1 = {}

            # If position of first team is better, he will be the \'home\' team, and vice-versa
            if positions[i] < positions[i + 1]:
                dict1.update({\'home_team\': matches[j][0], \'away_team\': matches[j][1]})
            else:
                dict1.update({\'home_team\': matches[j][1], \'away_team\': matches[j][0]})

            # Append updated dictionary to the list, that will later be converted into a DataFrame
            pred_set.append(dict1)
            i += 2
            j += 1

        # Convert list into DataFrame
        pred_set = pd.DataFrame(pred_set)
        backup_pred_set = pred_set

        # Get dummy variables and drop winning_team column
        pred_set = pd.get_dummies(pred_set, prefix=[\'home_team\', \'away_team\'], columns=[\'home_team\', \'away_team\'])

        # Add missing columns compared to the model\'s training dataset
        missing_cols2 = set(final.columns) - set(pred_set.columns)
        for c in missing_cols2:
            pred_set[c] = 0
        pred_set = pred_set[final.columns]

        # Remove winning team column
        pred_set = pred_set.drop([\'winning_team\'], axis=1)

        # Predict!
        predictions = logreg.predict(pred_set)
        for i in range(len(pred_set)):
            print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
            if predictions[i] == 2:
                print("Winner: " + backup_pred_set.iloc[i, 1])
            elif predictions[i] == 1:
                print("Draw")
            elif predictions[i] == 0:
                print("Winner: " + backup_pred_set.iloc[i, 0])
            print(\'Probability of \' + backup_pred_set.iloc[i, 1] + \' winning: \' , \'%.3f\'%(logreg.predict_proba(pred_set)[i][2]))
            print(\'Probability of Draw: \', \'%.3f\'%(logreg.predict_proba(pred_set)[i][1]))
            print(\'Probability of \' + backup_pred_set.iloc[i, 0] + \' winning: \', \'%.3f\'%(logreg.predict_proba(pred_set)[i][0]))
            print("")

    clean_and_predict(group_16, ranking, final, logreg)
  1. Portugal and Uruguay
  2. Winner: Portugal
  3. Probability of Portugal winning: 0.428
  4. Probability of Draw: 0.285
  5. Probability of Uruguay winning: 0.287
  6.  
  7. France and Croatia
  8. Winner: France
  9. Probability of France winning: 0.481
  10. Probability of Draw: 0.252
  11. Probability of Croatia winning: 0.267
  12.  
  13. Brazil and Mexico
  14. Winner: Brazil
  15. Probability of Brazil winning: 0.695
  16. Probability of Draw: 0.209
  17. Probability of Mexico winning: 0.096
  18.  
  19. England and Colombia
  20. Winner: England
  21. Probability of England winning: 0.516
  22. Probability of Draw: 0.368
  23. Probability of Colombia winning: 0.116
  24.  
  25. Spain and Russia
  26. Winner: Spain
  27. Probability of Spain winning: 0.529
  28. Probability of Draw: 0.280
  29. Probability of Russia winning: 0.191
  30.  
  31. Argentina and Peru
  32. Winner: Argentina
  33. Probability of Argentina winning: 0.713
  34. Probability of Draw: 0.212
  35. Probability of Peru winning: 0.075
  36.  
  37. Germany and Switzerland
  38. Winner: Germany
  39. Probability of Germany winning: 0.672
  40. Probability of Draw: 0.192
  41. Probability of Switzerland winning: 0.137
  42.  
  43. Belgium and Poland
  44. Winner: Belgium
  45. Probability of Belgium winning: 0.513
  46. Probability of Draw: 0.202
  47. Probability of Poland winning: 0.285

之后依次进行四分之一、半决赛、决赛的模拟:

四分之一:

  1. # List of matches
  2. quarters = [(\'Portugal\', \'France\'),
  3. (\'Spain\', \'Argentina\'),
  4. (\'Brazil\', \'England\'),
  5. (\'Germany\', \'Belgium\')]
  6. clean_and_predict(quarters, ranking, final, logreg)
  1. Portugal and France
  2. Winner: Portugal
  3. Probability of Portugal winning: 0.437
  4. Probability of Draw: 0.256
  5. Probability of France winning: 0.307
  6.  
  7. Argentina and Spain
  8. Winner: Argentina
  9. Probability of Argentina winning: 0.518
  10. Probability of Draw: 0.262
  11. Probability of Spain winning: 0.220
  12.  
  13. Brazil and England
  14. Winner: Brazil
  15. Probability of Brazil winning: 0.525
  16. Probability of Draw: 0.216
  17. Probability of England winning: 0.260
  18.  
  19. Germany and Belgium
  20. Winner: Germany
  21. Probability of Germany winning: 0.563
  22. Probability of Draw: 0.269
  23. Probability of Belgium winning: 0.167
  1. 半决赛:
  1. # List of matches
  2. semi = [(\'Portugal\', \'Brazil\'),
  3. (\'Argentina\', \'Germany\')]
  4. clean_and_predict(semi, ranking, final, logreg)
  1. Brazil and Portugal
  2. Winner: Brazil
  3. Probability of Brazil winning: 0.705
  4. Probability of Draw: 0.152
  5. Probability of Portugal winning: 0.143
  6.  
  7. Germany and Argentina
  8. Winner: Germany
  9. Probability of Germany winning: 0.441
  10. Probability of Draw: 0.264
  11. Probability of Argentina winning: 0.295
  1. 决赛:
  1. # Finals
  2. finals = [(\'Brazil\', \'Germany\')]
  3. clean_and_predict(finals, ranking, final, logreg)
  1. Germany and Brazil
  2. Winner: Brazil
  3. Probability of Germany winning: 0.359
  4. Probability of Draw: 0.220
  5. Probability of Brazil winning: 0.421

根据该模型,巴西将有可能获得本届世界杯的冠军。

进一步的研究和提高领域:

  • 为提高数据集的质量,可以利用 FIFA 的比赛数据评估每个球员的水平;
  • 混淆矩阵可以帮助我们分析模型预测的哪场有误;
  • 我们可以尝试将多个模型组合在一起,提高预测准确度。

版权声明:本文为jlutiger原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/jlutiger/p/9184914.html