GPU UMAP

目录

GPU UMAP#

UMAP 是一种流行的降维方法，对于有意义地分析大型复杂数据集非常有帮助。Graphistry 提供了方便的绑定来使用 cuml.UMAP。

UMAP 是：* 对最近邻数量感兴趣 * 非线性的，与长期存在的方法如 PCA 不同 * 非缩放的，这保持了计算的快速 * 随机的，因此是非确定性的——不同的库处理这一点的方式不同，正如你将在本笔记本中看到的 * umap-learn 声明 “运行之间的差异将存在，无论多小” * cuml 目前使用 “精确的 kNN”。这可能会在未来的版本中改变

进一步阅读：

第一部分: CPU Baseline in Python Pandas
第二部分: 使用RAPIDS Python cudf绑定的GPU数据框架
第三部分：GPU SQL - 已弃用，因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL
第四部分: 使用RAPIDS cuML UMAP和PyGraphistry进行GPU机器学习
Graphistry cuGraph 绑定

克隆并安装graphistry，打印版本#

[9]:

import pandas as pd, networkx as nx
# !git clone https://github.com/graphistry/pygraphistry.git

from time import time
!pip install -U pygraphistry/ --quiet

import graphistry
graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='***', password='***')
graphistry.__version__

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[9]:

'0.27.2+4.ga674343.dirty'

[2]:

import pandas as pd, numpy as np
start_u = pd.to_datetime('2016-01-01').value//10**9
end_u = pd.to_datetime('2021-01-01').value//10**9
samples=1000
# df = pd.DataFrame(np.random.randint(,100,size=(samples, 1)), columns=['user_id', 'age', 'profile'])
df = pd.DataFrame(np.random.randint(18,75,size=(samples, 1)), columns=['age'])
df['user_id'] = np.random.randint(0,200,size=(samples, 1))
df['profile'] = np.random.randint(0,1000,size=(samples, 1))
df['date']=pd.to_datetime(np.random.randint(start_u, end_u, samples), unit='s').date

# df[['lat','lon']]=(np.round(np.random.uniform(, 180,size=(samples,2)), 5))
df['lon']=np.round(np.random.uniform(20, 24,size=(samples)), 2)
df['lat']=np.round(np.random.uniform(110, 120,size=(samples)), 2)
df['location']=df['lat'].astype(str) +","+ df["lon"].astype(str)
df.drop(columns=['lat','lon'],inplace=True)
df = df.applymap(str)
df

[2]:

	age	user_id	profile	date	location
0	32	185	357	2017-06-16	117.81,22.87
1	66	86	84	2020-03-30	110.07,20.52
2	28	26	862	2019-05-12	116.16,23.02
3	69	193	607	2019-03-11	112.21,23.25
4	34	27	4	2019-08-06	114.56,20.99
...	...	...	...	...	...
995	52	128	435	2016-10-19	115.3,23.67
996	67	116	97	2016-04-24	117.69,23.92
997	32	55	915	2018-11-07	113.63,22.74
998	72	68	148	2020-05-23	116.39,21.25
999	56	19	932	2016-04-23	116.2,23.54

1000 行 × 5 列

[3]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap()
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.14064184427261353 line/min: 7110.259433612426']

参数: `X` 和 `y`, `feature_engine`, 等#

[4]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'])
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0287002166112264 line/min: 34842.94260026035']

[5]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot()

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0024895787239074705 line/min: 401674.38386140653']

测试各种其他参数

[6]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(X=['user_id'],y=['date','location'], feature_engine='torch', n_neighbors= 2,min_dist=.1, spread=.1, local_connectivity=2, n_components=5,metric='hellinger')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])
g2.plot(render=False)

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 14) in UMAP fit, as it is not one dimensional

['time: 0.0022179365158081056 line/min: 450869.5325013168']

测试 `engine` 标志以查看速度提升#

[7]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='cuml')
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.00446544885635376 line/min: 223941.65338544376']

[8]:

g = graphistry.nodes(df)
t=time()
g2 = g.umap(engine='umap_learn') ## note this will take appreciable time depending on sample count defined above
min=(time()-t)/60
lin=df.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

* Ignoring target column of shape (1000, 0) in UMAP fit, as it is not one dimensional

['time: 0.11818180878957113 line/min: 8461.539134001174']

现在让我们看一些真实的数据：#

[12]:

G=pd.read_csv('pygraphistry/demos/data/honeypot.csv')

g = graphistry.nodes(G)
t=time()
g3 = g.umap(engine='cuml')#-learn')
min=(time()-t)/60
lin=G.shape[0]/min
print(['time: '+str(min)+' line/min: '+str(lin)])

! Failed umap speedup attempt. Continuing without memoization speedups.* Ignoring target column of shape (220, 0) in UMAP fit, as it is not one dimensional

['time: 0.008098324139912924 line/min: 27166.11439590581']

[13]:

print(g3._edges.info())
g3._edges.sample(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2410 entries, 0 to 2821
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   _src_implicit  2410 non-null   int32
 1   _dst_implicit  2410 non-null   int32
 2   _weight        2410 non-null   float32
dtypes: float32(1), int32(2)
memory usage: 47.1 KB
None

[13]:

	_src_implicit	_dst_implicit	_weight
671	51	123	0.017956
2123	167	194	0.663975
1761	139	78	0.113361
2444	191	3	0.999991
2441	190	152	0.544303

[16]:

#g3.plot()

下一步#

第一部分: CPU Baseline in Python Pandas
第二部分: 使用RAPIDS Python cudf绑定的GPU数据框架
第三部分：GPU SQL - 已弃用，因为Dask-SQL在RAPIDS生态系统中取代了BlazingSQL
第四部分：使用RAPIDS cuML UMAP和PyGraphistry进行GPU机器学习
Graphistry cuGraph 绑定