博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Machine Learning with MATLAB 1.1 to 2.2
阅读量:2134 次
发布时间:2019-04-30

本文共 16459 字,大约阅读时间需要 54 分钟。

Machine Learning with MATLAB 1.1 to 2.2

1.1 Course Overview

1.2 Review - Machine Learning Onramp

Two files contain data for a selection of basketball players.

bballPlayers.txt bballStats.txt
This file contains information about each player, like their position, height, and weight. This file contains player statistics for each year, such as games played, points scored, and rebounds.

In this lesson, you will:

  • Import and format the data stored in the files.
  • Group statistics by player and merge the data sets into a single table.
  • Visualize various player statistics and explore features.
  • Train a classification model to predict a player’s position.
  • Evaluate the model’s performance.

1.2.1 Import Data

positions = ["G","G-F","F-G","F","F-C","C-F","C"]
Task 1

To bring data from a file named dataFile.txt into MATLAB as a table named T, you can use the readtable function.

playerInfo = readtable("bball")
Task 2

When text labels are intended to represent a finite set of possibilities, such as a player’s position, it’s more suitable to store the data as a categorical array.

You can use the categorical function to convert an array of text labels to a categorical array.

playerInfo.pos = categorical(playerInfo.pos)	%convert an array of text to a categorical array.

The function categories returns a list of all possible categories in a categorical array.

categories(playerInfo.pos)	%output: ["F-C-G","F-G-C","G","G-F","F-G","F","F-C","C-F","C"]
Task 3

Sometimes you may want to define the categories for your categorical data. For example, if you are pulling data from multiple sources and not all the categories are represented in that particular set.

You can use a string array of category names as a second input to the categorical function to specify the categories.

playerInfo.pos = categorical(playerInfo.pos,positions)

Use the categories defined in the string array positions to convert the data in playerInfo.pos into a categorical.

categories(playerInfo.pos)	%output:["G","G-F","F-G","F","F-C","C-F","C"]

Other classifications are forced to convert to <undefined>.

Task 4

Any data not specified by a category in positions becomes <undefined>.

To remove rows from a table which contain an undefined or missing value, you can use the rmmissing function.

playerInfo.pos = rmmissing(playerInfo.pos)
Task 5
allStats = readtable("bballStats.txt")
Task 6

You can remove rows or variables from a table by first selecting those rows or variables and then assigning to them an empty array.

For example, the following command remove everything after the 18th column in allStats.

allStats(:,19:end) = []

1.2.3 Group and Merge Data

Task 1

The groupsummary function performs grouped calculations. For example, the following command calculates the standard deviation (std) of the data in data, grouped by data.Label.

stdevData = groupsummary(data,"Label","std")

Create a table named playerStats which calculates the sum of the data in allStats grouped by "playerID".

You may leave off the semicolon to view the output.

playerStats = groupsummary(allStats,"playerID","sum")
Task 2

The table playerStats has variable names similar to those in allStats, but prepended with sum_. It also has an additional variable named GroupCount.

You can access the variable names in a table using the VariableNames property of the Properties of the table.

Remember you can remove a variable from a table by assigning it the empty array [].

table.Properties.VariableNamestable.variable = []

Remove the variable GroupCount from playerStats. Then replace the variable names in playerStats with the variable names in allStats.

playerStats.GroupCount = [];allStats.Properties.VariableNames = playerStats.Properties.VariableNames
Task 3

You can combine, or join, two tables by matching up rows with the same key variable values. The key variable playerID can be used to join the data from playerInfo and playerStats.

The innerjoin function joins two tables, and includes only observations whose key variable values appear in both tables.

Create a table named data which joins playerInfo and playerStats, and includes only players who appear in both tables.

data = innerjoin(playerInfo,playerStats)
Further Practice
>> Join Table

This command allow you to operate in a visual window.

1.2.4 Explore Data

Task 1

You can plot some results from the basketball player data to explore relationships between various statistics and player position.

Create a box plot of player height for each position.

boxplot(data.height,data.pos)ylabel("Height (inches)")

Task 2

It looks as though guards (G) are generally shorter than forwards (F) or centers ©. What other patterns can we find in the data?

You can use gscatter to explore the relationship between two variables, grouped by position.

Plot points against rebounds, grouped by position.

gscatter(data.rebounds,data.points,data.pos)

Task 3

You can see some groupings for different positions. However, the data’s range makes it difficult to compare players, since most players are clustered tightly around the origin.

Some players played much more than others, leading to higher total points and rebounds. One way to account for this difference is to divide points and rebounds by the number of games played.

In the table data, the variable GP contains the number of games played. You can use element-wise division (./) to calculate per game statistics.

Plot points per game against rebounds per game, grouped by player position.

gscatter(data.rebounds./data.GP,data.points./data.GP,data.pos)

Task 4

The data points are no longer clustered around the origin, but the data points for each position are still spread out. Can a different normalization yield more insight?

Another way to account for difference in play time is to divide by the number of minutes played, data.minutes.

Plot points per minute against rebounds per minute, grouped by player position.

gscatter(data.rebounds./data.minutes,data.points./data.minutes,data.pos)

1.2.5 Train a Model and Make Predictions

Task 1

The normalized (per minute) numeric statistics from the basketball player data set has been divided into a training set dataTrain and a testing set dataTest. You will train a classification model using the training set, then make predictions for the testing set.

A k-nearest neighbor (kNN) model classifies an observation as the same class as the nearest known examples. You can fit a kNN model by passing a table of data to the fitcknn function.

The second input is the name of the response variable in the table (that is, the variable you want the model to predict). The output is a variable containing the fitted model.

Fit a kNN model to the data stored in dataTrain. The known classes are the player positions, stored in the variable named "pos". Store the fitted model in a variable called knnmodel.

knnmodel = fitcknn(dataTrain,"pos")
Task 2

The predict function determines the predicted class of new observations.

The inputs are the trained model and a table of observations, with the same predictor variables as was used to train the model. The output is a categorical array of the predicted class for each observation in newdata.

Predict the positions for the data in dataTest. Store the predictions in a variable called predPos.

predPos = predict(knnmodel,dataTest)
Task 3

How well did the kNN model predict player position?

A commonly-used metric to evaluate a model is the misclassification rate (the proportion of incorrect predictions). This metric is also called the model’s loss.

You can use the loss function to calculate the misclassification rate for a data set.

Calculate the misclassification rate for dataTest, and assign the result to the variable mdlLoss.

mdlLoss = loss(knnmodel,dataTest)
output : mdlloss = 0.63224826

The loss value calculated using the previous method will be slightly different from the value calculated by this method:

allwrong = sum(predPos ~= dataTest.pos)rate = allwrong / numel(predPos)
output : 0.63186813
Task 4

The loss value indicates that over 60% of the positions were predicted incorrectly. Did the model misclassify some positions more than others?

A confusion matrix gives the number of observations from each class that are predicted to be each class. It’s commonly visualized by shading the elements according to their value, with the diagonal elements (the correct classifications) shaded in one color and the other elements (the incorrect classifications) in another color. You can visualize a confusion matrix using the confusionchart function.

confusionchart(ytrue,ypred);

ytrue is a vector of the known classes and ypred is a vector of the predicted classes.

The table dataTest contains the known player positions, which you can compare with the predicted positions, predPos.

Use the confusionchart function to compare predPos to the known labels (stored in dataTest.pos).

confusionchart(dataTest.pos,predPos)

1.2.6 Evaluate the Model and Iterate

Task 1

By default, fitcknn fits a kNN model with k = 1. That is, the model uses the class of the single closest “neighbor” to classify a new observation.

The model’s performance may improve if the value of k is increased – that is, it uses the most common class of several neighbors, instead of just one.

You can change the value of k by setting the "NumNeighbors" property when calling fitcknn.

mdl = fitcknn(table,"ResponseVariable", ...    "NumNeighbors",7)

Modify the fitcknn function call on line 3. Set the "NumNeighbors" property to 5.

knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);
Task 2

Using 5 nearest neighbors reduced the loss, but the model still misclassifies over 50% of the test data set.

Many machine learning methods use the distance between observations as a similarity measure. Smaller distances indicate more similar observations.

In the basketball data set, the statistics have different units and scales, which means some statistics will contribute more than others to the distance calculation. Centering and scaling each statistic makes them contribute more evenly.

By setting the "Standardize" property to true in the fitcknn function, each column of predictor data is normalized to have mean 0 and standard deviation 1, then the model is trained using the standardized data.

Modify line 3 again. Add to the fitcknn function call to also set the "Standardize" property to true

knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);

1.2.7 Course Quick Reference

2.1 Course Example - Grouping Basketball Players

2.2 Low Dimensional Visualization

2.2.3 Multidimensional Scaling

dimension : 维度

approximate representation : 近似表示方法

Principle Component Analysis(PCA) and Classical Multi-dimensional Scaling : 主成分分析(PCA)和经典的多维缩放法

orthogonal coordinate system : 正交坐标系

Euclidean distance : 欧氏距离

Manhattan Distance (City block) : 曼哈顿距离

eigenvalues : 特征值

2.2.4 Classical Multidimensional Scaling

Task 1 Calculate pairwise distances

The matrix X has 4 columns, and therefore would be best visualized using 4 dimensions. In this activity, you will use multidimensional scaling to visualize the data using fewer dimensions, while retaining most of the information it contains.

load datawhos X
Name Size Byte Class Attributes
X 124*4 3968 double

To perform multidimensional scaling, you must first calculate the pairwise distances between observations. You can use the pdist function to calculate these distances.

distances = pdist(data,"distance");

Calculate the pairwise distances between rows of X, and name the result D.

D = pdist(X,"distance")

output :

20210404-150058-0283.png


The output D demonstrate a distance or dissimilarity vector containing the distance between each pair of observations.

D is of length 124(124-1)/2

The input X is an 124-by-4 numeric matrix containing the data. Each of the 124 rows is considered an observation.

The optional input “distance” indicates the method of caculating the distance or dissimilarity. Commonly used methods are:

"euclidean" %(default)"cityblock""correlation"

Task 2 Perform multidimensional scaling

The cmdscale function finds a configuration matrix and its corresponding eigenvalues for a set of pairwise distances.

[configMat,eigVal] = cmdscale(distances);

Find the configuration matrix and its eigenvalues for the distance matrix D, and name them Y and e, respectively.

[Y,e] = cmdscale(D)

output :

20210404-151559-0754.png20210404-155259-0582.png


The cmdscale is that Classical MultiDimensional Scaling.

The input D is a distance or dissimilarity vector.

The output Y is a 124-by-q matrix of the reconstructed coordinates in q-dimensional space.

q is the minimum number of dimensions needed to achieve the given pairwise distances.

e is the Eigenvalues of the matrix x*x'.


You can use the eigenvalues e to determine if a low-dimensional approximation to the points in x provides a reasonable representation of the data. If the first p eigenvalues are significantly larger than the rest, the points are well approximated by the first p dimensions (that is, the first p columns of x).

Task 3 Visualizes relative magnitudes of vector

You can use the pareto function to create a Pareto chart, which visualizes relative magnitudes of a vector in descending order.

pareto(vector)

Create a Pareto chart of the eigenvalues e.

pareto(e)

In this result, the first 3 columns of e are distinctly larger than others, so we can retain them and not losing too much information of raw data.


Task 4 Create a scatter of the first two columns of a matrix M

From the Pareto chart, you can see that over 90% of the distribution is described with just two variables.

You can use the scatter function to create a scatter plot of the first two columns of a matrix M.

scatter(M(:,1),M(:,2))

Use scatter to create a scatter plot of the first two columns of Y.

scatter(Y(:,1),Y(:,2))
Task 5 Creates a three-dimensional scatter plot

From the Pareto chart, notice that 100% of the distribution is described with three variables.

The scatter3 function creates a three-dimensional scatter plot. You can use scatter3 to create a scatter plot of three columns of a matrix M.

scatter3(M(:,1),M(:,2),M(:,3))

Use scatter3 to create a scatter plot of the first three columns of Y.

scatter3(Y(:,1),Y(:,2),Y(:,3))

Different view of the three-dimensional scatter plot.

2.2.5 Nonclassical Multidimensional Scaling

When you use the cmdscale function, it determines how many dimensions are returned in the configuration matrix.

To find a configuration matrix with a specified number of dimensions, you can use the mdscale function.

In fact, cmdscale() alow you to determine the number of dimensions return to a variable, too.

configMat = mdscale(distances,numDims);

Calculate the pairwise distance between rows of X, and name it D. Then find the configuration matrix in 2 dimensions of the distances and name it Y.

load datawhos X%%%%%%%%%%D = pdist(X)Y = cmdscale(D,2)% Y = mdscale(D,2)scatter(Y(:,1),Y(:,2))

20210404-160120-0867.png

20210404-161127-0675.png

20210404-161120-0648.png

2.2.6 Principal Component Analysis (PCA)

Another commonly used method for dimensionality reduction is principal component analysis (PCA). Use the function pca to perform principal component analysis.

[pcs,scrs,~,~,pexp] = pca(data)

pca() takes the raw observations as input.

But cmdscale() and mdscale() requires a distance array as input.

load datawhos X[pcs,scrs,~,~,pexp] = pca(X)pareto(pexp)scatter(scrs(:,1),scrs(:,2))
20210404-160041-0678.png

20210404-165742-0857.png

2.2.10 Basketball Players

The statsNorm variable contains numeric statistics for several basketball players, normalized to have mean 0 and standard deviation 1.

Use classical multidimensional (CMD) scaling to find the reconstructed coordinates and corresponding eigenvalues for the data in statsNorm. Plot the Pareto chart of the eigenvalues.

%This code loads and formats the data.data = readtable("bball.txt");data.pos = categorical(data.pos);%This code extracts and normalizes the columns of interest.stats = data{:,[5 6 11:end]};	% Extract the 5 6 11 row in 'data' into the variable 'stats'.statsNorm = normalize(stats);	% normalize the data.% Task 1D = pdist(statsNorm)[Y,e] = cmdscale(D)pareto(e)scatter3(Y(:,1),Y(:,2),Y(:,3))view(100,50)	% Change the view of the three-dimensional plot.% Task 2[~,scores,~,~,explained] = pca(statsNorm)pareto(explained)scatter3(scores(:,1),scores(:,2),scores(:,3))view(100,50)% Task 3scatter3(Y(:,1),Y(:,2),-Y(:,3),10,data.pos)	% the third value is changed to negative.c = colorbar;	% add colorbar to figure.c.TickLabels = categories(data.pos);scatter3(scores(:,1),scores(:,2),-scores(:,3),10,data.pos)c = colorbar;c.TickLabels = categories(data.pos);

20210404-175421-0112.png

the scatter plot of the CMD values

20210404-175622-0501.png

the scatter plot of the PCA values

转载地址:http://zjugf.baihongyu.com/

你可能感兴趣的文章
解决Ubuntu 64bit下使用交叉编译链提示error while loading shared libraries: libz.so.1
查看>>
VS生成DLL文件供第三方调用
查看>>
Android Studio color和font设置
查看>>
Python 格式化打印json数据(展开状态)
查看>>
Centos7 安装curl(openssl)和libxml2
查看>>
Centos7 离线安装RabbitMQ,并配置集群
查看>>
Centos7 or Other Linux RPM包查询下载
查看>>
运行springboot项目出现:Type javax.xml.bind.JAXBContext not present
查看>>
Java中多线程向mysql插入同一条数据冲突问题
查看>>
Idea Maven项目使用jar包,添加到本地库使用
查看>>
FastDFS集群架构配置搭建(转载)
查看>>
HTM+CSS实现立方体图片旋转展示效果
查看>>
FFmpeg 命令操作音视频
查看>>
问题:Opencv(3.1.0/3.4)找不到 /opencv2/gpu/gpu.hpp 问题
查看>>
目的:使用CUDA环境变量CUDA_VISIBLE_DEVICES来限定CUDA程序所能使用的GPU设备
查看>>
问题:Mysql中字段类型为text的值, java使用selectByExample查询为null
查看>>
程序员--学习之路--技巧
查看>>
解决问题之 MySQL慢查询日志设置
查看>>
contOS6 部署 lnmp、FTP、composer、ThinkPHP5、docker详细步骤
查看>>
TP5.1模板布局中遇到的坑,配置完不生效解决办法
查看>>