奇异值分解在统计中的主要应用为主成分分析（PCA）。数据集的特征值（在SVD中用奇异值表征）按照重要性排列，降维的过程就是舍弃不重要的特征向量的过程，而剩下的特征向量张成空间为降维后的空间。
摘自SVD维基百科

第一次作业-大数据处理与可视化分析

作业一（10分）

随机构造100*1000的用户-商家稀疏矩阵A（思考
如何让生成的矩阵满足稀疏性？）
利用SVD对矩阵A进行分解,求解U和V矩阵，并
计算r=10条件下的降维矩阵；
从矩阵A中选择两行i和j，使得Ai.和Aj.中不
为0的元素的交集为空，计算i和j在低维空间的相似
度。

随机生成稀疏矩阵A

In [ ]:

import scipy.sparse as sparse

m=100
n=1000
density=0.05
matrixformat='coo' #稀疏矩阵存储格式
B=sparse.rand(m,n,density=density,format=matrixformat,dtype=None)
B = B.todense()
B.shape

Out[ ]:

(100, 1000)

svd 奇异值分解

In [ ]:

from scipy import linalg
U,sigma,V = linalg.svd(B)
# U
sigma
# V

Out[ ]:

array([8.93964862, 5.48278381, 5.36655726, 5.3255837 , 5.21364595,
       5.20737242, 5.10192193, 5.0620928 , 5.03008488, 4.97668339,
       4.97266981, 4.95176028, 4.91282338, 4.87533471, 4.82759205,
       4.78706258, 4.7544839 , 4.74378353, 4.70796851, 4.66225737,
       4.64391281, 4.61614613, 4.57444229, 4.56855915, 4.54699385,
       4.48385541, 4.47729552, 4.43210337, 4.42078991, 4.39144925,
       4.3679347 , 4.35384228, 4.3243931 , 4.27869585, 4.25504224,
       4.21917624, 4.20960625, 4.20224166, 4.16219185, 4.14033402,
       4.11690205, 4.09735693, 4.07396685, 4.04315248, 4.02525479,
       4.014447  , 3.99821387, 3.976753  , 3.96253811, 3.925201  ,
       3.89705301, 3.87543386, 3.84105746, 3.81917748, 3.8131703 ,
       3.78832872, 3.77909794, 3.75941617, 3.71438735, 3.69457749,
       3.66015594, 3.62733751, 3.61016161, 3.59135862, 3.56015115,
       3.5504101 , 3.54236261, 3.50401097, 3.48236387, 3.47850091,
       3.46173718, 3.40493574, 3.39865938, 3.35621324, 3.33774056,
       3.3193311 , 3.31584103, 3.25276965, 3.24629126, 3.23813756,
       3.19274254, 3.19061905, 3.15012911, 3.11179983, 3.10692166,
       3.07249756, 3.03811826, 3.02004638, 2.99165758, 2.98369651,
       2.94127266, 2.89840336, 2.884009  , 2.83756567, 2.81545674,
       2.76210203, 2.71557685, 2.71316961, 2.67810644, 2.65285722])

r=10 的降维矩阵：V_10_T

In [ ]:

print(V.shape)
V_10_T = V[:10,:].T
print(V_10_T.shape)

(1000, 1000)
(1000, 10)

寻找降维前完全不相似的向量，并计算降维后这两个向量的余弦相似度

In [ ]:

import numpy as np
def func():
    for i in range(0,100):
        for j in range(0,100):
            if(B[i] * B[j].T == [[0]]):
                # print(B[i],B[j])
                return i,j
def cosSim(x,y):
    '''
    余弦相似度
    '''
    tmp=np.sum(x*y.T)
    non=np.linalg.norm(x)*np.linalg.norm(y)
    return np.round(tmp/float(non),9)

i,j = func() 
              
Bi_d = B[i] * V_10_T 
Bj_d = B[j] * V_10_T    
print(Bi_d,Bj_d) 
print(cosSim(Bi_d,Bj_d))           

[[ 0.95642839 -0.27502353 -0.835077    0.35242053 -0.01606326  0.30347212
  -0.47576853  0.01046348 -0.1825462   0.78894515]] [[ 0.66067271  0.29407694  0.32758199  0.04380316 -0.0195309  -0.14866655
   0.33761032  0.51772913  0.42221464 -0.65366609]]
-0.233686123

可以看到降维前完全不相似的向量降维后具有了相似性

Post Views: 360

作业

SVD（奇异值分解）

第一次作业-大数据处理与可视化分析

About the author

zhao

Add Comment

Cancel reply

zhao

Get in touch

第一次作业-大数据处理与可视化分析

About the author

zhao

Add Comment

Cancel reply

Read more

zhao

Get in touch