奇异值分解在统计中的主要应用为主成分分析(PCA)。数据集的特征值(在SVD中用奇异值表征)按照重要性排列,降维的过程就是舍弃不重要的特征向量的过程,而剩下的特征向量张成空间为降维后的空间。
摘自SVD维基百科
第一次作业-大数据处理与可视化分析
作业一 (10分)
- 随机构造100*1000的用户-商家稀疏矩阵A(思考
如何让生成的矩阵满足稀疏性?) - 利用SVD对矩阵A进行分解,求解U和V矩阵,并
计算r=10条件下的降维矩阵; - 从矩阵A中选择两行i和j,使得Ai.和Aj.中不
为0的元素的交集为空,计算i和j在低维空间的相似
度。
随机生成稀疏矩阵A
In [ ]:
import scipy.sparse as sparse
m=100
n=1000
density=0.05
matrixformat='coo' #稀疏矩阵存储格式
B=sparse.rand(m,n,density=density,format=matrixformat,dtype=None)
B = B.todense()
B.shape
Out[ ]:
(100, 1000)
svd 奇异值分解
In [ ]:
from scipy import linalg
U,sigma,V = linalg.svd(B)
# U
sigma
# V
Out[ ]:
array([8.93964862, 5.48278381, 5.36655726, 5.3255837 , 5.21364595,
5.20737242, 5.10192193, 5.0620928 , 5.03008488, 4.97668339,
4.97266981, 4.95176028, 4.91282338, 4.87533471, 4.82759205,
4.78706258, 4.7544839 , 4.74378353, 4.70796851, 4.66225737,
4.64391281, 4.61614613, 4.57444229, 4.56855915, 4.54699385,
4.48385541, 4.47729552, 4.43210337, 4.42078991, 4.39144925,
4.3679347 , 4.35384228, 4.3243931 , 4.27869585, 4.25504224,
4.21917624, 4.20960625, 4.20224166, 4.16219185, 4.14033402,
4.11690205, 4.09735693, 4.07396685, 4.04315248, 4.02525479,
4.014447 , 3.99821387, 3.976753 , 3.96253811, 3.925201 ,
3.89705301, 3.87543386, 3.84105746, 3.81917748, 3.8131703 ,
3.78832872, 3.77909794, 3.75941617, 3.71438735, 3.69457749,
3.66015594, 3.62733751, 3.61016161, 3.59135862, 3.56015115,
3.5504101 , 3.54236261, 3.50401097, 3.48236387, 3.47850091,
3.46173718, 3.40493574, 3.39865938, 3.35621324, 3.33774056,
3.3193311 , 3.31584103, 3.25276965, 3.24629126, 3.23813756,
3.19274254, 3.19061905, 3.15012911, 3.11179983, 3.10692166,
3.07249756, 3.03811826, 3.02004638, 2.99165758, 2.98369651,
2.94127266, 2.89840336, 2.884009 , 2.83756567, 2.81545674,
2.76210203, 2.71557685, 2.71316961, 2.67810644, 2.65285722])
r=10 的降维矩阵:V_10_T
In [ ]:
print(V.shape)
V_10_T = V[:10,:].T
print(V_10_T.shape)
(1000, 1000) (1000, 10)
寻找降维前完全不相似的向量,并计算降维后这两个向量的余弦相似度
In [ ]:
import numpy as np
def func():
for i in range(0,100):
for j in range(0,100):
if(B[i] * B[j].T == [[0]]):
# print(B[i],B[j])
return i,j
def cosSim(x,y):
'''
余弦相似度
'''
tmp=np.sum(x*y.T)
non=np.linalg.norm(x)*np.linalg.norm(y)
return np.round(tmp/float(non),9)
i,j = func()
Bi_d = B[i] * V_10_T
Bj_d = B[j] * V_10_T
print(Bi_d,Bj_d)
print(cosSim(Bi_d,Bj_d))
[[ 0.95642839 -0.27502353 -0.835077 0.35242053 -0.01606326 0.30347212 -0.47576853 0.01046348 -0.1825462 0.78894515]] [[ 0.66067271 0.29407694 0.32758199 0.04380316 -0.0195309 -0.14866655 0.33761032 0.51772913 0.42221464 -0.65366609]] -0.233686123
可以看到降维前完全不相似的向量降维后具有了相似性