GPU UMAP#
UMAP 是一种流行的降维方法,对于有意义地分析大型复杂数据集非常有帮助。Graphistry 提供了方便的绑定来使用 cuml.UMAP。
UMAP 是:* 对最近邻数量感兴趣 * 非线性的,与长期存在的方法如 PCA 不同 * 非缩放的,这保持了计算的快速 * 随机的,因此是非确定性的——不同的库处理这一点的方式不同,正如你将在本笔记本中看到的 * umap-learn 声明 “运行之间的差异将存在,无论多小” * cuml 目前使用 “精确的 kNN”。这可能会在 未来的版本 中改变
进一步阅读:
第三部分:GPU SQL - 已弃用,因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL
克隆并安装graphistry,打印版本#
[9]:
import pandas as pd, networkx as nx
# !git clone https://github.com/graphistry/pygraphistry.git
from time import time
!pip install -U pygraphistry/ --quiet
import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='***', password='***')
graphistry.__version__
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[9]:
'0.27.2+4.ga674343.dirty'
[2]:
import pandas as pd, numpy as np
start_u = pd.to_datetime('2016-01-01').value//10**9
end_u = pd.to_datetime('2021-01-01').value//10**9
samples=1000
# df = pd.DataFrame(np.random.randint(,100,size=(samples, 1)), columns=['user_id', 'age', 'profile'])
df = pd.DataFrame(np.random.randint(18,75,size=(samples, 1)), columns=['age'])
df['user_id'] = np.random.randint(0,200,size=(samples, 1))
df['profile'] = np.random.randint(0,1000,size=(samples, 1))
df['date']=pd.to_datetime(np.random.randint(start_u, end_u, samples), unit='s').date
# df[['lat','lon']]=(np.round(np.random.uniform(, 180,size=(samples,2)), 5))
df['lon']=np.round(np.random.uniform(20, 24,size=(samples)), 2)
df['lat']=np.round(np.random.uniform(110, 120,size=(samples)), 2)
df['location']=df['lat'].astype(str) +","+ df["lon"].astype(str)
df.drop(columns=['lat','lon'],inplace=True)
df = df.applymap(str)
df
[2]:
| age | user_id | profile | date | location | |
|---|---|---|---|---|---|
| 0 | 32 | 185 | 357 | 2017-06-16 | 117.81,22.87 |
| 1 | 66 | 86 | 84 | 2020-03-30 | 110.07,20.52 |
| 2 | 28 | 26 | 862 | 2019-05-12 | 116.16,23.02 |
| 3 | 69 | 193 | 607 | 2019-03-11 | 112.21,23.25 |
| 4 | 34 | 27 | 4 | 2019-08-06 | 114.56,20.99 |
| ... | ... | ... | ... | ... | ... |
| 995 | 52 | 128 | 435 | 2016-10-19 | 115.3,23.67 |
| 996 | 67 | 116 | 97 | 2016-04-24 | 117.69,23.92 |
| 997 | 32 | 55 | 915 | 2018-11-07 | 113.63,22.74 |
| 998 | 72 | 68 | 148 | 2020-05-23 | 116.39,21.25 |
| 999 | 56 | 19 | 932 | 2016-04-23 | 116.2,23.54 |
1000 行 × 5 列
[3]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap()
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.14064184427261353 line/min: 7110.259433612426']
参数: X 和 y, feature_engine, 等#
[4]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'])
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0287002166112264 line/min: 34842.94260026035']
[5]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0024895787239074705 line/min: 401674.38386140653']
测试各种其他参数
[6]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch', n_neighbors= 2,min_dist=.1, spread=.1, local_connectivity=2, n_components=5,metric='hellinger')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot(render=False)
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional
['time: 0.0022179365158081056 line/min: 450869.5325013168']
测试 engine 标志以查看速度提升#
[7]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='cuml')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.00446544885635376 line/min: 223941.65338544376']
[8]:
g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='umap_learn') ## note this will take appreciable time depending on sample count defined above
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional
['time: 0.11818180878957113 line/min: 8461.539134001174']
现在让我们看一些真实的数据:#
[12]:
G=pd.read_csv('pygraphistry/demos/data/honeypot.csv')
g = graphistry.nodes(G)
t=time()
g3 = g.umap(engine='cuml')#-learn')
min=(time()-t)/60
lin=G.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (220, 0) in UMAP fit, as it is not one dimensional
['time: 0.008098324139912924 line/min: 27166.11439590581']
[13]:
print(g3._edges.info())
g3._edges.sample(5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2410 entries, 0 to 2821
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _src_implicit 2410 non-null int32
1 _dst_implicit 2410 non-null int32
2 _weight 2410 non-null float32
dtypes: float32(1), int32(2)
memory usage: 47.1 KB
None
[13]:
| _src_implicit | _dst_implicit | _weight | |
|---|---|---|---|
| 671 | 51 | 123 | 0.017956 |
| 2123 | 167 | 194 | 0.663975 |
| 1761 | 139 | 78 | 0.113361 |
| 2444 | 191 | 3 | 0.999991 |
| 2441 | 190 | 152 | 0.544303 |
[16]:
#g3.plot()
下一步#
第三部分:GPU SQL - 已弃用,因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL