GoogleML系列视频笔记 | 沐雨浥尘

GoogleML系列视频笔记

GoogleML系列目前共七个短视频,内容通俗易懂

Lesson1 Hello World

1
2
3
4
5
6
from sklearn import tree
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [1, 1, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
print clf.predict([150, 0])

Lesson 2 Visualizing a Decision Tree

Why decision Tree

  • easy to read and understand

Iris(Wiki)

  • 经典的ML问题,花类型识别
  • 四个features,三个labels
  • 直接从sklearn导入
1
2
3
4
5
6
7
# 查看数据
from sklearn.datasets import load_iris
iris = load_iris()
print iris.feature_names
print iris.target_names
print iris.data[0]
print iris.target[0]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
[ 5.1  3.5  1.4  0.2]
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 建立决策树分类器
import numpy as np
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
# 测试集索引,每个类取一个example作为测试集
test_idx = [0, 50, 100]

# training data
train_target = np.delete(iris.target, test_idx)
train_data = np.delete(iris.data, test_idx, axis=0)

# testing data
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]

clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)

print test_target
print clf.predict(test_data)
[0 1 2]
[0 1 2]

Visualize 可视化

使用pydot

  • pip install pydotplus
  • conda install graphviz

GraphViz’s executables not found

  • 下载graphviz
  • 安装,记下安装路径,如C:\Program Files (x86)\Graphviz2.38\bin
  • 将路径添加到系统环境变量
  • 重启IDE
1
2
3
4
5
6
7
8
9
from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png(), width=500, height=500)

png

Lesson 3 What Makes a Good Feature

  • 多个feature能更好的训练模型
  • 重复特征应当删除,否则分类器会多次使用相同特征,导致该特征被强调
  • feature分布越均匀,该feature对分类的作用越弱
  • feature应相互独立
  • feature应预处理,如经纬度信息经过转化可以形成距离等
  • 总结
    • informative
    • independent
    • simple
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import matplotlib.pyplot as plt

# 构造500只greyhounds和500只labs
greyhounds = 500
labs = 500

grey_height = 28 + 4 * np.random.randn(greyhounds)
lab_height = 24 + 4 * np.random.randn(labs)

plt.hist([grey_height, lab_height], stacked=True, color=['r', 'b'])
plt.show()

png

Lesson 4 Let’s Write a Pipeline

  • 划分训练集跟测试集,在训练集上训练,测试集上验证
  • 调用sklearn.cross_validation.train_test_split切分数据集
  • 本质上,是学习feature到label,从输入到输出的函数
  • 神经网络演示playground
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# import a dataset
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

# split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
tree_clf = tree.DecisionTreeClassifier()
kn_clf = KNeighborsClassifier()

tree_clf.fit(X_train, y_train)
kn_clf.fit(X_train, y_train)

tree_pred = tree_clf.predict(X_test)
kn_pred = kn_clf.predict(X_test)

from sklearn.metrics import accuracy_score
print 'tree_clf accuracy:', accuracy_score(y_test, tree_pred)
print 'kn_clf accuracy:', accuracy_score(y_test, kn_pred)
tree_clf accuracy: 0.946666666667
kn_clf accuracy: 0.986666666667

Lesson 5 Writing Our First Classifier

简单的随机分类器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import random
class random_clf():
def fit(self, X_train, y_train):
# pass
self.X_train = X_train
self.y_train = y_train

def predict(self, X_test):
# pass
predictions = []
for row in X_test:
label = random.choice(self.y_train)
predictions.append(label)
return predictions
# import a dataset
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

# split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

clf = random_clf()
clf.fit(X_train, y_train)
clf_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score
print 'accuracy:', accuracy_score(y_test, clf_pred)
accuracy: 0.36

KNN (K-Nearest Neighbour)

  • 考虑测试点的近邻K个点,K个点中,属于某一类最多,则该点属于该类
  • 距离公式,平方和开方
  • 注意:
    • 实现时先确定接口(fit, predict)
    • 对每个接口实现时先确定输入输出
  • K = 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from scipy.spatial import distance

# 计算a, b之间的距离
def euc(a,b):
return distance.euclidean(a, b)

class knn_clf():
def fit(self, X_train, y_train):
# pass
self.X_train = X_train
self.y_train = y_train

def predict(self, X_test):
# pass
predictions = []
for row in X_test:
label = self.closest(row)
predictions.append(label)
return predictions

def closest(self, row):
best_dist = euc(row, self.X_train[0])
best_index = 0
for i in range(1, len(self.X_train)):
dist = euc(row, self.X_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]

# import a dataset
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

# split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

clf = knn_clf()
clf.fit(X_train, y_train)
clf_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score
print 'accuracy:', accuracy_score(y_test, clf_pred)
accuracy: 0.986666666667

Lesson 6 Train an Image Classifier with TensorFlow for Poets

No feature engineering needed!!!

数据

  • 五种花图片 218MB
  • 如果你想要用其他的图片类型,你只需要创建一个新的文件夹,放入对应类型的100张以上的图片

Diversity and quantity

  • Diversity:样本多样性越多,对新事物的预测能力越强
  • Quantity:样本数量越多,分类器越强大

以下代码在Linux下执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn import datasets, cross_validation
# tensorflow 1.1.0
import tensorflow as tf
# load datesets
tf.logging.set_verbosity(tf.logging.ERROR) # 忽略其他日志信息
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.2)

# Construct DNN

# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column('', dimension=4)]

# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, # 指定数据特征,维数4
hidden_units=[10, 20, 10], # Three hidden layers, containing 10, 20, and 10 neurons, respectively.
n_classes=3, # 三类
model_dir='/tmp/iris_model')

classifier.fit(x=X_train, y=y_train, steps=1000)
score = classifier.evaluate(x=X_test, y=y_test, steps=1)['accuracy']
print('\nTest Accuracy: {0:f}\n'.format(score))
Test Accuracy: 1.000000

Lesson 7 Classifying Handwritten Digits with TF.Learn

mnist问题

  • the Hello World of computer vision
  • 训练集55,000,测试集10,000,每张图片处理成28*28的二维矩阵,784 features
  • 十分类问题
1
2
3
4
5
6
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
learn = tf.contrib.learn
tf.logging.set_verbosity(tf.logging.ERROR)

Import the dataset

1
2
3
4
5
6
7
8
9
10
mnist = learn.datasets.load_dataset('mnist')
data = mnist.train.images
labels = np.asarray(mnist.train.labels, dtype=np.int32)
test_data = mnist.test.images
test_labels = np.asarray(mnist.test.labels, dtype=np.int32)

# 减少数据
max_examples = 10000
data = data[:max_examples]
labels = labels[:max_examples]
Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz

显示

1
2
3
4
5
6
7
def display(i):
img = test_data[i]
plt.title('Example %d. Label: %d' % (i, test_labels[i]))
plt.imshow(img.reshape((28,28)), cmap=plt.cm.gray_r)

display(0)
print 'number of features is', len(data[0])
number of features is 784

png

fit a Linear Classifier

1
2
3
4
5
6
feature_columns = learn.infer_real_valued_columns_from_input(data)
classifier = learn.LinearClassifier(feature_columns=feature_columns, n_classes=10)
classifier.fit(data, labels, batch_size=100, steps=1000)

classifier.evaluate(test_data, test_labels)
print classifier.evaluate(test_data, test_labels)['accuracy']
0.9137

Visualize learned weights

1
2
3
4
5
6
7
8
9
10
weights = classifier.weights_
f, axes = plt.subplots(2, 5, figsize=(10,4))
axes = axes.reshape(-1)
for i in range(len(axes)):
a = axes[i]
a.imshow(weights.T[i].reshape(28, 28), cmap=plt.cm.seismic)
a.set_title(i)
a.set_xticks(()) # ticks be gone
a.set_yticks(())
plt.show()
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-32-13532f014713> in <module>()
----> 1 weights = classifier.weights_
      2 f, axes = plt.subplots(2, 5, figsize=(10,4))
      3 axes = axes.reshape(-1)
      4 for i in range(len(axes)):
      5     a = axes[i]


AttributeError: 'LinearClassifier' object has no attribute 'weights_'

附录代码分析 From ahangchen

代码分析

  • 下载数据集
1
mnist = learn.datasets.load_dataset('mnist')

恩,就是这么简单,一行代码下载解压mnist数据,每个img已经灰度化成长784的数组,每个label已经one-hot成长度10的数组

  • numpy读取图像到内存,用于后续操作,包括训练集(只取前10000个)和验证集
1
2
3
4
5
6
7
data = mnist.train.images
labels = np.asarray(mnist.train.labels, dtype=np.int32)
test_data = mnist.test.images
test_labels = np.asarray(mnist.test.labels, dtype=np.int32)
max_examples = 10000
data = data[:max_examples]
labels = labels[:max_examples]
  • 可视化图像
1
2
3
4
5
def display(i):
img = test_data[i]
plt.title('Example %d. Label: %d' % (i, test_labels[i]))
plt.imshow(img.reshape((28, 28)), cmap=plt.cm.gray_r)
plt.show()

用matplotlib展示灰度图

  • 训练分类器
    • 提取特征(这里每个图的特征就是784个像素值)
1
feature_columns = learn.infer_real_valued_columns_from_input(data)
  • 创建线性分类器并训练
1
2
classifier = learn.LinearClassifier(feature_columns=feature_columns, n_classes=10)
classifier.fit(data, labels, batch_size=100, steps=1000)

注意要制定n_classes为labels的数量

  • 分类器实际上是在根据每个feature判断每个label的可能性,
  • 不同的feature有的重要,有的不重要,所以需要设置不同的权重
  • 一开始权重都是随机的,在fit的过程中,实际上就是在调整权重

  • 最后可能性最高的label就会作为预测输出

  • 传入测试集,预测,评估分类效果
1
2
result = classifier.evaluate(test_data, test_labels)
print result["accuracy"]

速度非常快,而且准确率达到91.4%

可以只预测某张图,并查看预测是否跟实际图形一致

1
2
3
4
5
6
# here's one it gets right
print ("Predicted %d, Label: %d" % (classifier.predict(test_data[0]), test_labels[0]))
display(0)
# and one it gets wrong
print ("Predicted %d, Label: %d" % (classifier.predict(test_data[8]), test_labels[8]))
display(8)
  • 可视化权重以了解分类器的工作原理
1
2
weights = classifier.weights_
a.imshow(weights.T[i].reshape(28, 28), cmap=plt.cm.seismic)

  • 这里展示了8个张图中,每个像素点(也就是feature)的weights,
  • 红色表示正的权重,蓝色表示负的权重
  • 作用越大的像素,它的颜色越深,也就是权重越大
  • 所以权重中红色部分几乎展示了正确的数字
Buy me a cup of coffee