11分钟了解Vaex#
如果你想在实时的 Python 内核中尝试这个笔记本,请使用 mybinder:
数据框#
Vaex 的核心是 DataFrame(类似于 Pandas DataFrame,但更高效),我们通常使用变量 df 来表示它。DataFrame 是对大型表格数据集的高效表示,并具有以下特点:
一些列,例如
x,y和z,它们是:由 Numpy 数组支持;
由表达式系统包装,例如
df.x,df['x']或df.col.x是一个表达式;列/表达式可以执行惰性计算,例如
df.x * np.sin(df.y)在需要结果之前不会执行任何操作。
一组虚拟列,这些列由(惰性)计算支持,例如
df['r'] = df.x/df.y一组选择,可以用来探索数据集,例如
df.select(df.x < 0)过滤的DataFrames,不会复制数据,
df_negative = df[df.x < 0]
让我们从一个示例数据集开始,该数据集包含在Vaex中。
[1]:
import vaex
df = vaex.example()
df # Since this is the last statement in a cell, it will print the DataFrame in a nice HTML format.
[1]:
| # | id | x | y | z | vx | vy | vz | E | L | Lz | FeH |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1.2318683862686157 | -0.39692866802215576 | -0.598057746887207 | 301.1552734375 | 174.05947875976562 | 27.42754554748535 | -149431.40625 | 407.38897705078125 | 333.9555358886719 | -1.0053852796554565 |
| 1 | 23 | -0.16370061039924622 | 3.654221296310425 | -0.25490644574165344 | -195.00022888183594 | 170.47216796875 | 142.5302276611328 | -124247.953125 | 890.2411499023438 | 684.6676025390625 | -1.7086670398712158 |
| 2 | 32 | -2.120255947113037 | 3.326052665710449 | 1.7078403234481812 | -48.63423156738281 | 171.6472930908203 | -2.079437255859375 | -138500.546875 | 372.2410888671875 | -202.17617797851562 | -1.8336141109466553 |
| 3 | 8 | 4.7155890464782715 | 4.5852508544921875 | 2.2515437602996826 | -232.42083740234375 | -294.850830078125 | 62.85865020751953 | -60037.0390625 | 1297.63037109375 | -324.6875 | -1.4786882400512695 |
| 4 | 16 | 7.21718692779541 | 11.99471664428711 | -1.064562201499939 | -1.6891745328903198 | 181.329345703125 | -11.333610534667969 | -83206.84375 | 1332.7989501953125 | 1328.948974609375 | -1.8570483922958374 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 329,995 | 21 | 1.9938701391220093 | 0.789276123046875 | 0.22205990552902222 | -216.92990112304688 | 16.124420166015625 | -211.244384765625 | -146457.4375 | 457.72247314453125 | 203.36758422851562 | -1.7451677322387695 |
| 329,996 | 25 | 3.7180912494659424 | 0.721337616443634 | 1.6415337324142456 | -185.92160034179688 | -117.25082397460938 | -105.4986572265625 | -126627.109375 | 335.0025634765625 | -301.8370056152344 | -0.9822322130203247 |
| 329,997 | 14 | 0.3688507676124573 | 13.029608726501465 | -3.633934736251831 | -53.677146911621094 | -145.15771484375 | 76.70909881591797 | -84912.2578125 | 817.1375732421875 | 645.8507080078125 | -1.7645612955093384 |
| 329,998 | 18 | -0.11259264498949051 | 1.4529125690460205 | 2.168952703475952 | 179.30865478515625 | 205.79710388183594 | -68.75872802734375 | -133498.46875 | 724.000244140625 | -283.6910400390625 | -1.8808952569961548 |
| 329,999 | 4 | 20.796220779418945 | -3.331387758255005 | 12.18841552734375 | 42.69000244140625 | 69.20479583740234 | 29.54275131225586 | -65519.328125 | 1843.07470703125 | 1581.4151611328125 | -1.1231083869934082 |
列#
上述预览显示,该数据集包含\(> 300,000\)行,以及名为x、y、z(位置)、vx、vy、vz(速度)、E(能量)、L(角动量)和id(样本子组)的列。当我们打印出一列时,我们可以看到它不是一个Numpy数组,而是一个Expression。
[2]:
df.x # df.col.x or df['x'] are equivalent, but df.x may be preferred because it is more tab completion friendly or programming friendly respectively
[2]:
Expression = x
Length: 330,000 dtype: float32 (column)
---------------------------------------
0 1.23187
1 -0.163701
2 -2.12026
3 4.71559
4 7.21719
...
329995 1.99387
329996 3.71809
329997 0.368851
329998 -0.112593
329999 20.7962
可以使用.values方法来获取表达式的内存表示。同样的方法也可以应用于DataFrame。
[3]:
df.x.values
[3]:
array([ 1.2318684 , -0.16370061, -2.120256 , ..., 0.36885077,
-0.11259264, 20.79622 ], dtype=float32)
大多数Numpy函数(ufuncs)可以在表达式上执行,并且不会直接产生结果,而是生成一个新的表达式。
[4]:
import numpy as np
np.sqrt(df.x**2 + df.y**2 + df.z**2)
[4]:
Expression = sqrt((((x ** 2) + (y ** 2)) + (z ** 2)))
Length: 330,000 dtype: float32 (expression)
-------------------------------------------
0 1.42574
1 3.66676
2 4.29824
3 6.95203
4 14.039
...
329995 2.15587
329996 4.12785
329997 13.5319
329998 2.61304
329999 24.3339
虚拟列#
有时将表达式存储为列是很方便的。我们称之为虚拟列,因为它不占用任何内存,并且在需要时即时计算。虚拟列被视为普通列。
[5]:
df['r'] = np.sqrt(df.x**2 + df.y**2 + df.z**2)
df[['x', 'y', 'z', 'r']]
[5]:
| # | x | y | z | r |
|---|---|---|---|---|
| 0 | 1.2318683862686157 | -0.39692866802215576 | -0.598057746887207 | 1.425736665725708 |
| 1 | -0.16370061039924622 | 3.654221296310425 | -0.25490644574165344 | 3.666757345199585 |
| 2 | -2.120255947113037 | 3.326052665710449 | 1.7078403234481812 | 4.298235893249512 |
| 3 | 4.7155890464782715 | 4.5852508544921875 | 2.2515437602996826 | 6.952032566070557 |
| 4 | 7.21718692779541 | 11.99471664428711 | -1.064562201499939 | 14.03902816772461 |
| ... | ... | ... | ... | ... |
| 329,995 | 1.9938701391220093 | 0.789276123046875 | 0.22205990552902222 | 2.155872344970703 |
| 329,996 | 3.7180912494659424 | 0.721337616443634 | 1.6415337324142456 | 4.127851963043213 |
| 329,997 | 0.3688507676124573 | 13.029608726501465 | -3.633934736251831 | 13.531896591186523 |
| 329,998 | -0.11259264498949051 | 1.4529125690460205 | 2.168952703475952 | 2.613041877746582 |
| 329,999 | 20.796220779418945 | -3.331387758255005 | 12.18841552734375 | 24.333894729614258 |
选择和过滤#
Vaex 在探索数据子集时非常高效,例如用于移除异常值或仅检查部分数据。Vaex 不会创建副本,而是在内部跟踪哪些行被选中。
[6]:
df.select(df.x < 0)
df.evaluate(df.x, selection=True)
[6]:
array([-0.16370061, -2.120256 , -7.7843747 , ..., -8.126636 ,
-3.9477386 , -0.11259264], dtype=float32)
当您频繁修改想要可视化的数据部分,或者当您想要有效地计算多个数据部分的统计信息时,选择非常有用。
或者,您也可以创建过滤后的数据集。这与使用 Pandas 类似,只是 Vaex 不会复制数据。
[7]:
df_negative = df[df.x < 0]
df_negative[['x', 'y', 'z', 'r']]
[7]:
| # | x | y | z | r |
|---|---|---|---|---|
| 0 | -0.16370061039924622 | 3.654221296310425 | -0.25490644574165344 | 3.666757345199585 |
| 1 | -2.120255947113037 | 3.326052665710449 | 1.7078403234481812 | 4.298235893249512 |
| 2 | -7.784374713897705 | 5.989774703979492 | -0.682695209980011 | 9.845809936523438 |
| 3 | -3.5571861267089844 | 5.413629055023193 | 0.09171556681394577 | 6.478376865386963 |
| 4 | -20.813940048217773 | -3.294677495956421 | 13.486607551574707 | 25.019264221191406 |
| ... | ... | ... | ... | ... |
| 166,274 | -2.5926425457000732 | -2.871671676635742 | -0.18048334121704102 | 3.8730955123901367 |
| 166,275 | -0.7566012144088745 | 2.9830434322357178 | -6.940553188323975 | 7.592250823974609 |
| 166,276 | -8.126635551452637 | 1.1619765758514404 | -1.6459038257598877 | 8.372657775878906 |
| 166,277 | -3.9477386474609375 | -3.0684902667999268 | -1.5822702646255493 | 5.244411468505859 |
| 166,278 | -0.11259264498949051 | 1.4529125690460205 | 2.168952703475952 | 2.613041877746582 |
N维网格上的统计#
Vaex 的一个核心特性是能够非常高效地计算 N 维网格上的统计数据。这对于大型数据集的可视化非常有用。
[8]:
df.count(), df.mean(df.x), df.mean(df.x, selection=True)
[8]:
(array(330000), array(-0.0632868), array(-5.18457762))
类似于SQL的groupby,Vaex使用binby概念,它告诉Vaex应该在常规网格上计算统计量(出于性能原因)
[9]:
counts_x = df.count(binby=df.x, limits=[-10, 10], shape=64)
counts_x
[9]:
array([1374, 1350, 1459, 1618, 1706, 1762, 1852, 2007, 2240, 2340, 2610,
2840, 3126, 3337, 3570, 3812, 4216, 4434, 4730, 4975, 5332, 5800,
6162, 6540, 6805, 7261, 7478, 7642, 7839, 8336, 8736, 8279, 8269,
8824, 8217, 7978, 7541, 7383, 7116, 6836, 6447, 6220, 5864, 5408,
4881, 4681, 4337, 4015, 3799, 3531, 3320, 3040, 2866, 2629, 2488,
2244, 1981, 1905, 1734, 1540, 1437, 1378, 1233, 1186])
这将生成一个Numpy数组,其中包含在x = -10和x = 10之间分布的64个区间中的计数。我们可以使用Matplotlib快速可视化这一点。
[10]:
import matplotlib.pyplot as plt
plt.plot(np.linspace(-10, 10, 64), counts_x)
plt.show()
我们也可以在二维中做同样的事情(实际上可以推广到N维!),并使用Matplotlib显示它。
[11]:
xycounts = df.count(binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xycounts
[11]:
array([[ 5, 2, 3, ..., 3, 3, 0],
[ 8, 4, 2, ..., 5, 3, 2],
[ 5, 11, 7, ..., 3, 3, 1],
...,
[ 4, 8, 5, ..., 2, 0, 2],
[10, 6, 7, ..., 1, 1, 2],
[ 6, 7, 9, ..., 2, 2, 2]])
[12]:
plt.imshow(xycounts.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
[13]:
v = np.sqrt(df.vx**2 + df.vy**2 + df.vz**2)
xy_mean_v = df.mean(v, binby=[df.x, df.y], limits=[[-10, 10], [-10, 20]], shape=(64, 128))
xy_mean_v
[13]:
array([[156.15283203, 226.0004425 , 206.95940653, ..., 90.0340627 ,
152.08784485, nan],
[203.81366634, 133.01436043, 146.95962524, ..., 137.54756927,
98.68717448, 141.06020737],
[150.59178772, 188.38820371, 137.46753802, ..., 155.96900177,
148.91660563, 138.48191833],
...,
[168.93819809, 187.75943136, 137.318647 , ..., 144.83927917,
nan, 107.7273407 ],
[154.80492783, 140.55182203, 180.30700166, ..., 184.01670837,
95.10913086, 131.18122864],
[166.06868235, 150.54079764, 125.84606828, ..., 130.56007385,
121.04217911, 113.34659195]])
[14]:
plt.imshow(xy_mean_v.T, origin='lower', extent=[-10, 10, -10, 20])
plt.show()
可以计算其他统计量,例如:
或者查看完整的列表在API文档。
获取您的数据#
在继续本教程之前,您可能希望读取自己的数据。最终,Vaex DataFrame 只是包装了一组 Numpy 数组。如果您可以将数据作为一组 Numpy 数组访问,您可以使用 from_arrays 轻松构建一个 DataFrame。
[15]:
import vaex
import numpy as np
x = np.arange(5)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df
[15]:
| # | x | y |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 4 |
| 3 | 3 | 9 |
| 4 | 4 | 16 |
其他快速获取数据的方法有:
from_arrow_table: Arrow 表格支持
from_csv: 逗号分隔文件
from_ascii: 空格/制表符分隔的文件
from_pandas: 将pandas DataFrame转换为
from_astropy_table: 转换一个astropy表
导出或将DataFrame转换为不同的数据结构也非常简单:
如今,将数据,尤其是较大的数据集放在云端是很常见的。Vaex 可以直接从 S3 以惰性方式读取数据,这意味着只会下载所需的数据,并将其缓存到磁盘上。
[16]:
# Read in the NYC Taxi dataset straight from S3
nyctaxi = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')
nyctaxi.head(5)
[16]:
| # | 供应商ID | 上车时间 | 下车时间 | 乘客数量 | 支付类型 | 行程距离 | 上车经度 | 上车纬度 | 费率代码 | 存储和转发标志 | 下车经度 | 下车纬度 | 费用金额 | 附加费 | 地铁税 | 小费金额 | 过路费 | 总金额 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | VTS | 2009-01-04 02:52:00.000000000 | 2009-01-04 03:02:00.000000000 | 1 | CASH | 2.63 | -73.992 | 40.7216 | nan | nan | -73.9938 | 40.6959 | 8.9 | 0.5 | nan | 0 | 0 | 9.4 |
| 1 | VTS | 2009-01-04 03:31:00.000000000 | 2009-01-04 03:38:00.000000000 | 3 | Credit | 4.55 | -73.9821 | 40.7363 | nan | nan | -73.9558 | 40.768 | 12.1 | 0.5 | nan | 2 | 0 | 14.6 |
| 2 | VTS | 2009-01-03 15:43:00.000000000 | 2009-01-03 15:57:00.000000000 | 5 | 信用 | 10.35 | -74.0026 | 40.7397 | nan | nan | -73.87 | 40.7702 | 23.7 | 0 | nan | 4.74 | 0 | 28.44 |
| 3 | DDS | 2009-01-01 20:52:58.000000000 | 2009-01-01 21:14:00.000000000 | 1 | CREDIT | 5 | -73.9743 | 40.791 | nan | nan | -73.9966 | 40.7318 | 14.9 | 0.5 | nan | 3.05 | 0 | 18.45 |
| 4 | DDS | 2009-01-24 16:18:23.000000000 | 2009-01-24 16:24:56.000000000 | 1 | CASH | 0.4 | -74.0016 | 40.7194 | nan | nan | -74.0084 | 40.7203 | 3.7 | 0 | nan | 0 | 0 | 3.7 |
绘图#
一维和二维#
大多数可视化是在1或2维中完成的,Vaex很好地封装了Matplotlib,以满足各种常见的使用场景。
[17]:
import vaex
import numpy as np
df = vaex.example()
最简单的可视化是使用DataFrame.viz.histogram的一维图。在这里,我们只显示99.7%的数据。
[1]:
df.viz.histogram(df.x, limits='99.7%')
一个稍微复杂一些的可视化方法,不是绘制计数,而是绘制该分箱的不同统计量。在大多数情况下,传递what='参数即可,其中是上述列表中或API文档中提到的任何统计量。
[19]:
df.viz.histogram(df.x, what='mean(E)', limits='99.7%');
一个等效的方法是使用vaex.stat.函数,例如vaex.stat.mean。
[20]:
df.viz.histogram(df.x, what=vaex.stat.mean(df.E), limits='99.7%');
vaex.stat. 对象与 Vaex 表达式非常相似,因为它们代表了一个基础的计算。典型的算术和 Numpy 函数可以应用于这些计算。然而,这些对象计算的是单个统计量,并不返回列或表达式。
[21]:
np.log(vaex.stat.mean(df.x)/vaex.stat.std(df.x))
[21]:
log((mean(x) / std(x)))
这些统计对象可以传递给what参数。其优势在于数据只需传递一次。
[22]:
df.viz.histogram(df.x, what=np.clip(np.log(-vaex.stat.mean(df.E)), 11, 11.4), limits='99.7%');
通过我们自己计算统计量并将其传递给plot1d的grid参数,可以获得类似的结果。需要注意的是,用于计算统计量和绘图的限制必须相同,否则x轴可能无法与真实数据对应。
[3]:
limits = [-30, 30]
shape = 64
meanE = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
grid = np.clip(np.log(-meanE), 11, 11.4)
df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E');
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb Cell 46' in <cell line: 3>()
<a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=0'>1</a> limits = [-30, 30]
<a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=1'>2</a> shape = 64
----> <a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=2'>3</a> meanE = df.mean(df.E, binby=df.x, limits=limits, shape=shape)
<a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=3'>4</a> grid = np.clip(np.log(-meanE), 11, 11.4)
<a href='vscode-notebook-cell:/Users/nickcrews/Documents/projects/vaex/docs/source/tutorial.ipynb#ch0000045?line=4'>5</a> df.viz.histogram(df.x, grid=grid, limits=limits, ylabel='clipped E')
NameError: name 'df' is not defined
除了在一维上绘制密度(直方图)外,我们还可以在二维上绘制密度。这是通过DataFrame.viz.heatmap函数完成的。它共享许多参数,并且与直方图非常相似。
[24]:
df.viz.heatmap(df.x, df.y, what=vaex.stat.mean(df.E)**2, limits='99.7%');
绘图选择#
虽然过滤对于缩小DataFrame的内容非常有用(例如df_negative = df[df.x < 0]),但这种方法也有一些缺点。首先,一个实际问题是,当你以4种不同的方式进行过滤时,你将需要4个不同的DataFrame来污染你的命名空间。更重要的是,当Vaex执行一系列统计计算时,它将为每个DataFrame执行这些计算,这意味着将对数据进行4次遍历,尽管这4个DataFrame都指向相同的基础数据。
如果我们的DataFrame中有4个(命名的)选择,我们可以在一次数据遍历中计算统计信息,这在数据集大于内存的情况下尤其显著加快速度。
在下图中,我们展示了三个选择,默认情况下它们混合在一起,只需要一次数据遍历。
[25]:
df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), limits='99.7%',
selection=[None, df.x < df.y, df.x < -10]);
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/image.py:113: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
rgba_dest[:, :, c][[mask]] = np.clip(result[[mask]], 0, 1)
高级绘图#
假设我们想要并排查看两个图表。为了实现这一点,我们可以传递一个表达式对的列表。
[26]:
df.viz.heatmap([["x", "y"], ["x", "z"]], limits='99.7%',
title="Face on and edge on", figsize=(10,4));
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/viz/mpl.py:779: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot(gs[row_offset + row * row_scale:row_offset + (row + 1) * row_scale, column * column_scale:(column + 1) * column_scale])
默认情况下,如果您有多个图表,它们会显示为列,多个选择会叠加显示,多个“whats”(统计信息)会显示为行。
[27]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
limits='99.7%',
what=[np.log(vaex.stat.count()+1), vaex.stat.mean(df.E)],
selection=[None, df.x < df.y],
title="Face on and edge on", figsize=(10,10));
请注意,选择在底部行没有效果。
然而,可以使用visual参数来改变这种行为。
[28]:
df.viz.heatmap([["x", "y"], ["x", "z"]],
limits='99.7%',
what=vaex.stat.mean(df.E),
selection=[None, df.Lz < 0],
visual=dict(column='selection'),
title="Face on and edge on", figsize=(10,10));
第三维度的切片#
如果提供了第三个轴(z),你可以“切片”数据,将z切片显示为行。请注意,这里的行是换行的,可以使用wrap_columns参数来更改。
[29]:
df.viz.heatmap("Lz", "E",
limits='99.7%',
z="FeH:-2.5,-1,8", show=True, visual=dict(row="z"),
figsize=(12,8), f="log", wrap_columns=3);
小型数据集的可视化#
尽管Vaex专注于大型数据集,但有时你最终会得到一小部分数据(例如由于选择),并且你想制作散点图。你可以使用以下方法来实现:
[30]:
import vaex
df = vaex.example()
[31]:
import matplotlib.pyplot as plt
x = df.evaluate("x", selection=df.Lz < -2500)
y = df.evaluate("y", selection=df.Lz < -2500)
plt.scatter(x, y, c="red", alpha=0.5, s=4);
[32]:
df.viz.scatter(df.x, df.y, selection=df.Lz < -2500, c="red", alpha=0.5, s=4)
df.viz.scatter(df.x, df.y, selection=df.Lz > 1500, c="green", alpha=0.5, s=4);
在控制中#
虽然Vaex提供了Matplotlib的封装,但在某些情况下,您可能希望使用DataFrame.viz方法,但又希望控制绘图。Vaex简单地使用当前的图形和轴对象,因此很容易实现。
[33]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,7))
plt.sca(ax1)
selection = df.Lz < -2500
x = df[selection].x.evaluate()#selection=selection)
y = df[selection].y.evaluate()#selection=selection)
df.viz.heatmap(df.x, df.y)
plt.scatter(x, y)
plt.xlabel('my own label $\gamma$')
plt.xlim(-20, 20)
plt.ylim(-20, 20)
plt.sca(ax2)
df.viz.histogram(df.x, label='counts', n=True)
x = np.linspace(-30, 30, 100)
std = df.std(df.x.expression)
y = np.exp(-(x**2/std**2/2)) / np.sqrt(2*np.pi) / std
plt.plot(x, y, label='gaussian fit')
plt.legend()
plt.show()
Healpix (绘图)#
Healpix 绘图通过 healpy 包支持。Vaex 不需要对 healpix 进行特殊支持,仅用于绘图,但引入了一些辅助函数以使处理 healpix 更加容易。
在以下示例中,我们将使用TGAS天文数据集。
为了更好地理解healpix,我们将从头开始。如果我们想制作一个密度天空图,我们希望向healpy传递一个一维Numpy数组,其中每个值代表球体上某个位置的密度,该位置由数组大小(healpix级别)和偏移量(位置)决定。TGAS(和Gaia)数据包括编码在source_id中的healpix索引。通过将source_id除以34359738368,你将得到一个healpix索引级别12,进一步除以它将带你到更低的级别。
[34]:
import vaex
import healpy as hp
tgas = vaex.datasets.tgas(full=True)
我们将开始展示如何使用vaex.count手动对healpix分箱进行统计。我们将采用一个非常粗略的healpix方案(级别2)。
[35]:
level = 2
factor = 34359738368 * (4**(12-level))
nmax = hp.nside2npix(2**level)
epsilon = 1e-16
counts = tgas.count(binby=tgas.source_id/factor, limits=[-epsilon, nmax-epsilon], shape=nmax)
counts
[35]:
array([ 4021, 6171, 5318, 7114, 5755, 13420, 12711, 10193, 7782,
14187, 12578, 22038, 17313, 13064, 17298, 11887, 3859, 3488,
9036, 5533, 4007, 3899, 4884, 5664, 10741, 7678, 12092,
10182, 6652, 6793, 10117, 9614, 3727, 5849, 4028, 5505,
8462, 10059, 6581, 8282, 4757, 5116, 4578, 5452, 6023,
8340, 6440, 8623, 7308, 6197, 21271, 23176, 12975, 17138,
26783, 30575, 31931, 29697, 17986, 16987, 19802, 15632, 14273,
10594, 4807, 4551, 4028, 4357, 4067, 4206, 3505, 4137,
3311, 3582, 3586, 4218, 4529, 4360, 6767, 7579, 14462,
24291, 10638, 11250, 29619, 9678, 23322, 18205, 7625, 9891,
5423, 5808, 14438, 17251, 7833, 15226, 7123, 3708, 6135,
4110, 3587, 3222, 3074, 3941, 3846, 3402, 3564, 3425,
4125, 4026, 3689, 4084, 16617, 13577, 6911, 4837, 13553,
10074, 9534, 20824, 4976, 6707, 5396, 8366, 13494, 19766,
11012, 16130, 8521, 8245, 6871, 5977, 8789, 10016, 6517,
8019, 6122, 5465, 5414, 4934, 5788, 6139, 4310, 4144,
11437, 30731, 13741, 27285, 40227, 16320, 23039, 10812, 14686,
27690, 15155, 32701, 18780, 5895, 23348, 6081, 17050, 28498,
35232, 26223, 22341, 15867, 17688, 8580, 24895, 13027, 11223,
7880, 8386, 6988, 5815, 4717, 9088, 8283, 12059, 9161,
6952, 4914, 6652, 4666, 12014, 10703, 16518, 10270, 6724,
4553, 9282, 4981])
并且使用healpy的mollview我们可以将其可视化。
[36]:
hp.mollview(counts, nest=True)
为了简化生活,Vaex 包含了 DataFrame.healpix_count 来处理这个问题。
[37]:
counts = tgas.healpix_count(healpix_level=6)
hp.mollview(counts, nest=True)
或者更简单,使用 DataFrame.viz.healpix_heatmap
[38]:
tgas.viz.healpix_heatmap(
f="log1p",
healpix_level=6,
figsize=(10,8),
healpix_output="ecliptic"
)
xarray 支持#
df.count 方法也可以返回一个 xarray 数据数组,而不是 numpy 数组。这可以通过 array_type 关键字轻松实现。xarray 在 numpy 的基础上增加了维度标签、坐标和属性,使得处理多维数组更加方便。
[39]:
xarr = df.count(binby=[df.x, df.y], limits=[-10, 10], shape=64, array_type='xarray')
xarr
[39]:
- x: 64
- y: 64
- 6 3 7 9 10 13 6 13 17 7 12 15 7 14 ... 11 8 10 6 7 5 17 9 10 10 6 5 7
array([[ 6, 3, 7, ..., 15, 10, 11], [10, 3, 7, ..., 10, 13, 11], [ 5, 15, 5, ..., 12, 18, 12], ..., [ 7, 8, 10, ..., 6, 7, 7], [12, 10, 17, ..., 11, 8, 2], [ 7, 10, 13, ..., 6, 5, 7]]) - x(x)float64-9.844 -9.531 ... 9.531 9.844
array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875, -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125, -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375, -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625, -1.09375, -0.78125, -0.46875, -0.15625, 0.15625, 0.46875, 0.78125, 1.09375, 1.40625, 1.71875, 2.03125, 2.34375, 2.65625, 2.96875, 3.28125, 3.59375, 3.90625, 4.21875, 4.53125, 4.84375, 5.15625, 5.46875, 5.78125, 6.09375, 6.40625, 6.71875, 7.03125, 7.34375, 7.65625, 7.96875, 8.28125, 8.59375, 8.90625, 9.21875, 9.53125, 9.84375]) - y(y)float64-9.844 -9.531 ... 9.531 9.844
array([-9.84375, -9.53125, -9.21875, -8.90625, -8.59375, -8.28125, -7.96875, -7.65625, -7.34375, -7.03125, -6.71875, -6.40625, -6.09375, -5.78125, -5.46875, -5.15625, -4.84375, -4.53125, -4.21875, -3.90625, -3.59375, -3.28125, -2.96875, -2.65625, -2.34375, -2.03125, -1.71875, -1.40625, -1.09375, -0.78125, -0.46875, -0.15625, 0.15625, 0.46875, 0.78125, 1.09375, 1.40625, 1.71875, 2.03125, 2.34375, 2.65625, 2.96875, 3.28125, 3.59375, 3.90625, 4.21875, 4.53125, 4.84375, 5.15625, 5.46875, 5.78125, 6.09375, 6.40625, 6.71875, 7.03125, 7.34375, 7.65625, 7.96875, 8.28125, 8.59375, 8.90625, 9.21875, 9.53125, 9.84375])
此外,xarray 还有一个非常方便的绘图方法。由于 xarray 对象包含每个维度标签的信息,绘图轴将自动标记。
[40]:
xarr.plot();
将xarray作为输出有助于我们更快地探索数据内容。在下面的示例中,我们展示了如何轻松地绘制每个id组的样本位置(x,y)的二维分布。
注意xarray如何自动为图表添加适当的标题和轴标签。
[41]:
df.categorize('id', inplace=True) # treat the id as a categorical column - automatically adjusts limits and shape
xarr = df.count(binby=['x', 'y', 'id'], limits='95%', array_type='xarray')
np.log1p(xarr).plot(col='id', col_wrap=7);
交互式小部件#
注意: 交互式小部件需要一个正在运行的 Python 内核,如果您在线查看此文档,您可以感受到小部件的功能,但无法进行计算!
使用vaex-jupyter包,我们可以访问交互式小部件(请参阅Vaex Jupyter教程以获取更深入的教程)
[42]:
import vaex
import vaex.jupyter
import numpy as np
import matplotlib.pyplot as plt
df = vaex.example()
获取更交互式可视化(甚至打印统计信息)的最简单方法是使用vaex.jupyter.interactive_selection装饰器,它将在每次选择更改时执行装饰的函数。
[43]:
df.select(df.x > 0)
@vaex.jupyter.interactive_selection(df)
def plot(*args, **kwargs):
print("Mean x for the selection is:", df.mean(df.x, selection=True))
df.viz.heatmap(df.x, df.y, what=np.log(vaex.stat.count()+1), selection=[None, True], limits='99.7%')
plt.show()
在通过编程方式更改选择后,可视化将更新,打印输出也会更新。
[44]:
df.select(df.x > df.y)
然而,要获得真正交互式的可视化效果,我们需要使用小部件,例如bqplot库。再次强调,如果我们在这里进行选择,上面的可视化也会更新,所以让我们选择一个方形区域。
查看更多交互式小部件在Vaex Jupyter 教程
连接#
在Vaex中进行连接操作与Pandas类似,只是数据不会被复制。内部会为左侧DataFrame的每一行保留一个索引数组,指向右侧DataFrame,对于十亿行\(10^9\)的数据集大约需要8GB。让我们从两个小的DataFrames开始,df1 和 df2:
[47]:
a = np.array(['a', 'b', 'c'])
x = np.arange(1,4)
df1 = vaex.from_arrays(a=a, x=x)
df1
[47]:
| # | a | x |
|---|---|---|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | c | 3 |
[48]:
b = np.array(['a', 'b', 'd'])
y = x**2
df2 = vaex.from_arrays(b=b, y=y)
df2
[48]:
| # | b | y |
|---|---|---|
| 0 | a | 1 |
| 1 | b | 4 |
| 2 | d | 9 |
默认的连接方式是‘左’连接,其中左DataFrame(df1)的所有行都被保留,而右DataFrame(df2)的匹配行被添加。我们可以看到,对于列b和y,一些值是缺失的,正如预期的那样。
[49]:
df1.join(df2, left_on='a', right_on='b')
[49]:
| # | a | x | b | y |
|---|---|---|---|---|
| 0 | a | 1 | a | 1 |
| 1 | b | 2 | b | 4 |
| 2 | c | 3 | -- | -- |
‘右’连接基本上是一样的,但现在左右标签的角色互换了,所以现在我们有一些来自列x的值和一些缺失的值。
[50]:
df1.join(df2, left_on='a', right_on='b', how='right')
[50]:
| # | b | y | a | x |
|---|---|---|---|---|
| 0 | a | 1 | a | 1 |
| 1 | b | 4 | b | 2 |
| 2 | d | 9 | -- | -- |
我们也可以进行‘内’连接,其中输出的DataFrame只包含df1和df2之间共有的行。
[51]:
df1.join(df2, left_on='a', right_on='b', how='inner')
[51]:
| # | a | x | b | y |
|---|---|---|---|---|
| 0 | a | 1 | a | 1 |
| 1 | b | 2 | b | 4 |
目前不支持其他连接(例如外部连接)。欢迎在GitHub上提出问题。
分组#
使用 Vaex 还可以进行快速的分组聚合操作。输出结果是 Vaex 数据框。让我们看几个例子。
[52]:
import vaex
animal = ['dog', 'dog', 'cat', 'guinea pig', 'guinea pig', 'dog']
age = [2, 1, 5, 1, 3, 7]
cuteness = [9, 10, 5, 8, 4, 8]
df_pets = vaex.from_arrays(animal=animal, age=age, cuteness=cuteness)
df_pets
[52]:
| # | 动物 | 年龄 | 可爱度 |
|---|---|---|---|
| 0 | 狗 | 2 | 9 |
| 1 | 狗 | 1 | 10 |
| 2 | 猫 | 5 | 5 |
| 3 | 豚鼠 | 1 | 8 |
| 4 | 豚鼠 | 3 | 4 |
| 5 | 狗 | 7 | 8 |
进行分组操作的语法几乎与Pandas相同。请注意,当将多个聚合传递给单个列或表达式时,输出列会被适当地命名。
[53]:
df_pets.groupby(by='animal').agg({'age': 'mean',
'cuteness': ['mean', 'std']})
[53]:
| # | 动物 | 年龄 | 可爱度均值 | 可爱度标准差 |
|---|---|---|---|---|
| 0 | 狗 | 3.33333 | 9 | 0.816497 |
| 1 | 猫 | 5 | 5 | 0 |
| 2 | 豚鼠 | 2 | 6 | 2 |
Vaex 支持多种聚合函数:
vaex.agg.count: 组中的元素数量
vaex.agg.first: 组中的第一个元素
vaex.agg.max: 组中的最大值
vaex.agg.min: 组中的最小值
vaex.agg.sum: 一组的总和
vaex.agg.mean: 组的平均值
vaex.agg.std: 一组数据的标准差
vaex.agg.var: 组的方差
vaex.agg.nunique: 组中唯一元素的数量
此外,我们可以在groupby方法中指定聚合操作。我们还可以根据需要命名生成的聚合列。
[54]:
df_pets.groupby(by='animal',
agg={'mean_age': vaex.agg.mean('age'),
'cuteness_unique_values': vaex.agg.nunique('cuteness'),
'cuteness_unique_min': vaex.agg.min('cuteness')})
[54]:
| # | 动物 | 平均年龄 | 可爱度唯一值 | 可爱度唯一最小值 |
|---|---|---|---|---|
| 0 | 狗 | 3.33333 | 3 | 8 |
| 1 | 猫 | 5 | 1 | 5 |
| 2 | 豚鼠 | 2 | 2 | 4 |
Vaex中聚合函数的一个强大特性是它们支持选择。这为我们提供了在聚合时进行选择的灵活性。例如,让我们计算这个示例DataFrame中宠物的平均可爱度,但按年龄分开。
[55]:
df_pets.groupby(by='animal',
agg={'mean_cuteness_old': vaex.agg.mean('cuteness', selection='age>=5'),
'mean_cuteness_young': vaex.agg.mean('cuteness', selection='~(age>=5)')})
[55]:
| # | 动物 | 平均可爱度_旧 | 平均可爱度_年轻 |
|---|---|---|---|
| 0 | 狗 | 8 | 9.5 |
| 1 | 猫 | 5 | nan |
| 2 | 豚鼠 | nan | 6 |
请注意,在最后一个示例中,分组的DataFrame对于没有样本的组包含NaN。
字符串处理#
字符串处理与Pandas类似,只是所有操作都是延迟执行的、多线程的,并且更快(在C++中)。查看API文档以获取更多示例。
[56]:
import vaex
text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
df = vaex.from_arrays(text=text)
df
[56]:
| # | 文本 |
|---|---|
| 0 | 某事 |
| 1 | 非常漂亮 |
| 2 | 即将到来 |
| 3 | 我们的 |
| 4 | 方式。 |
[57]:
df.text.str.upper()
[57]:
Expression = str_upper(text)
Length: 5 dtype: str (expression)
---------------------------------
0 SOMETHING
1 VERY PRETTY
2 IS COMING
3 OUR
4 WAY.
[58]:
df.text.str.title().str.replace('et', 'ET')
[58]:
Expression = str_replace(str_title(text), 'et', 'ET')
Length: 5 dtype: str (expression)
---------------------------------
0 SomEThing
1 Very PrETty
2 Is Coming
3 Our
4 Way.
[59]:
df.text.str.contains('e')
[59]:
Expression = str_contains(text, 'e')
Length: 5 dtype: bool (expression)
----------------------------------
0 True
1 True
2 False
3 False
4 False
[60]:
df.text.str.count('e')
[60]:
Expression = str_count(text, 'e')
Length: 5 dtype: int64 (expression)
-----------------------------------
0 1
1 2
2 0
3 0
4 0
不确定性的传播#
在科学中,人们经常处理测量不确定性(有时称为测量误差)。当对具有相关不确定性的量进行变换时,Vaex 可以自动计算这些变换量的不确定性。请注意,不确定性的传播需要导数和长方程的矩阵乘法,这并不复杂,但很繁琐。Vaex 可以自动计算所有依赖关系、导数并计算完整的协方差矩阵。
作为一个例子,让我们再次使用TGAS天文数据集。尽管TGAS数据集已经包含了银河系天球坐标(l和b),但让我们通过从赤经和赤纬进行坐标系旋转再次添加它们。我们可以应用类似的转换,并将球面银河坐标转换为笛卡尔坐标。
[61]:
# convert parallas to distance
tgas.add_virtual_columns_distance_from_parallax(tgas.parallax)
# 'overwrite' the real columns 'l' and 'b' with virtual columns
tgas.add_virtual_columns_eq2gal('ra', 'dec', 'l', 'b')
# and combined with the galactic sky coordinates gives galactic cartesian coordinates of the stars
tgas.add_virtual_columns_spherical_to_cartesian(tgas.l, tgas.b, tgas.distance, 'x', 'y', 'z')
[61]:
| # | astrometric_delta_q | astrometric_excess_noise | astrometric_excess_noise_sig | astrometric_n_bad_obs_ac | astrometric_n_bad_obs_al | astrometric_n_good_obs_ac | astrometric_n_good_obs_al | astrometric_n_obs_ac | astrometric_n_obs_al | astrometric_primary_flag | astrometric_priors_used | astrometric_relegation_factor | astrometric_weight_ac | astrometric_weight_al | b | dec | dec_error | dec_parallax_corr | dec_pmdec_corr | dec_pmra_corr | duplicated_source | ecl_lat | ecl_lon | hip | l | matched_observations | parallax | parallax_error | parallax_pmdec_corr | parallax_pmra_corr | phot_g_mean_flux | phot_g_mean_flux_error | phot_g_mean_mag | phot_g_n_obs | phot_variable_flag | pmdec | pmdec_error | pmra | pmra_error | pmra_pmdec_corr | ra | ra_dec_corr | ra_error | ra_parallax_corr | ra_pmdec_corr | ra_pmra_corr | random_index | ref_epoch | scan_direction_mean_k1 | scan_direction_mean_k2 | scan_direction_mean_k3 | scan_direction_mean_k4 | scan_direction_strength_k1 | scan_direction_strength_k2 | scan_direction_strength_k3 | scan_direction_strength_k4 | solution_id | source_id | tycho2_id | distance | x | y | z |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.9190566539764404 | 0.7171010000916003 | 412.6059727233687 | 1 | 0 | 78 | 79 | 79 | 79 | 84 | 3 | 2.9360971450805664 | 1.2669624084082898e-05 | 1.818157434463501 | -16.121042828114014 | 0.23539164875137225 | 0.21880220693566088 | -0.4073381721973419 | 0.06065881997346878 | -0.09945132583379745 | 70 | -16.121052173353853 | 42.64182504417002 | 13989 | 42.641804308626725 | 9 | 6.35295075173405 | 0.3079103606852086 | -0.10195717215538025 | -0.0015767893055453897 | 10312332.172993332 | 10577.365273118843 | 7.991377829505826 | 77 | b'NOT_AVAILABLE' | -7.641989988351149 | 0.08740179334554747 | 43.75231341609215 | 0.07054220642640081 | 0.21467718482017517 | 45.03433035439128 | -0.41497212648391724 | 0.30598928200282727 | 0.17996619641780853 | -0.08575969189405441 | 0.15920649468898773 | 243619 | 2015.0 | -113.76032257080078 | 21.39291763305664 | -41.67839813232422 | 26.201841354370117 | 0.3823484778404236 | 0.5382660627365112 | 0.3923785090446472 | 0.9163063168525696 | 1635378410781933568 | 7627862074752 | b'' | 0.15740717016058217 | 0.11123604040005637 | 0.10243667003803988 | -0.04370685490397632 |
| 1 | nan | 0.2534628812968044 | 47.316290890180255 | 2 | 0 | 55 | 57 | 57 | 57 | 84 | 5 | 2.6523141860961914 | 3.1600175134371966e-05 | 12.861557006835938 | -16.19302376369384 | 0.2000676896877873 | 1.1977893944215496 | 0.8376259803771973 | -0.9756439924240112 | 0.9725773334503174 | 70 | -16.19303311057312 | 42.761180489478576 | -2147483648 | 42.76115974936648 | 8 | 3.90032893506844 | 0.3234880030045522 | -0.8537789583206177 | 0.8397389650344849 | 949564.6488279914 | 1140.173576223928 | 10.580958718900256 | 62 | b'NOT_AVAILABLE' | -55.10917285969142 | 2.522928801165149 | 10.03626300124532 | 4.611413518289133 | -0.9963987469673157 | 45.1650067708984 | -0.9959233403205872 | 2.583882288511597 | -0.8609106540679932 | 0.9734798669815063 | -0.9724165201187134 | 487238 | 2015.0 | -156.432861328125 | 22.76607322692871 | -36.23965835571289 | 22.890602111816406 | 0.7110026478767395 | 0.9659702777862549 | 0.6461148858070374 | 0.8671600818634033 | 1635378410781933568 | 9277129363072 | b'55-28-1' | 0.25638863199686845 | 0.1807701962996959 | 0.16716755815017084 | -0.07150016957395491 |
| 2 | nan | 0.3989006354041912 | 221.18496561724646 | 4 | 1 | 57 | 60 | 61 | 61 | 84 | 5 | 3.9934017658233643 | 2.5633918994572014e-05 | 5.767529487609863 | -16.12335382439265 | 0.24882543945301736 | 0.1803264123376257 | -0.39189115166664124 | -0.19325552880764008 | 0.08942046016454697 | 70 | -16.123363170402296 | 42.69750168007008 | -2147483648 | 42.69748094193635 | 7 | 3.1553132200367373 | 0.2734838183180671 | -0.11855248361825943 | -0.0418587327003479 | 817837.6000768564 | 1827.3836759985832 | 10.743102380434273 | 60 | b'NOT_AVAILABLE' | -1.602867102186794 | 1.0352589283446592 | 2.9322836829569003 | 1.908644426623371 | -0.9142706990242004 | 45.08615483797584 | -0.1774432212114334 | 0.2138361631952843 | 0.30772241950035095 | -0.1848166137933731 | 0.04686680808663368 | 1948952 | 2015.0 | -117.00751495361328 | 19.772153854370117 | -43.108219146728516 | 26.7157039642334 | 0.4825277626514435 | 0.4287584722042084 | 0.5241528153419495 | 0.9030616879463196 | 1635378410781933568 | 13297218905216 | b'55-1191-1' | 0.31692574722846595 | 0.22376103019475546 | 0.2064625216744117 | -0.08801225918215205 |
| 3 | nan | 0.4224923646481251 | 179.98201436339852 | 1 | 0 | 51 | 52 | 52 | 52 | 84 | 5 | 4.215157985687256 | 2.8672602638835087e-05 | 5.3608622550964355 | -16.118206879297034 | 0.24821079122833972 | 0.20095844850181172 | -0.33721715211868286 | -0.22350119054317474 | 0.13181143999099731 | 70 | -16.11821622503516 | 42.67779093546686 | -2147483648 | 42.67777019818556 | 7 | 2.292366835156796 | 0.2809724206784257 | -0.10920235514640808 | -0.049440864473581314 | 602053.4754362862 | 905.8772856344845 | 11.075682394435745 | 61 | b'NOT_AVAILABLE' | -18.414912114825732 | 1.1298513589995536 | 3.661982345981763 | 2.065051873379775 | -0.9261773228645325 | 45.06654155758114 | -0.36570677161216736 | 0.2760390513575931 | 0.2028782218694687 | -0.058928851038217545 | -0.050908856093883514 | 102321 | 2015.0 | -132.42112731933594 | 22.56928253173828 | -38.95445251464844 | 25.878559112548828 | 0.4946548640727997 | 0.6384561061859131 | 0.5090736746788025 | 0.8989177942276001 | 1635378410781933568 | 13469017597184 | b'55-624-1' | 0.43623035574565916 | 0.30810014040531863 | 0.2840853806346911 | -0.12110624783986161 |
| 4 | nan | 0.3175001122010629 | 119.74837853832186 | 2 | 3 | 85 | 84 | 87 | 87 | 84 | 5 | 3.2356362342834473 | 2.22787512029754e-05 | 8.080779075622559 | -16.055471830750374 | 0.33504360351532875 | 0.1701298562030361 | -0.43870800733566284 | -0.27934885025024414 | 0.12179157137870789 | 70 | -16.0554811777948 | 42.77336987816832 | -2147483648 | 42.77334913546197 | 11 | 1.582076960273368 | 0.2615394689640736 | -0.329196035861969 | 0.10031197965145111 | 1388122.242048847 | 2826.428866453177 | 10.168700781271088 | 96 | b'NOT_AVAILABLE' | -2.379387386351838 | 0.7106320061478508 | 0.34080233369502516 | 1.2204755227890713 | -0.8336043357849121 | 45.13603822322069 | -0.049052558839321136 | 0.17069695283376776 | 0.4714251756668091 | -0.1563923954963684 | -0.15207625925540924 | 409284 | 2015.0 | -106.85968017578125 | 4.452099323272705 | -47.8953971862793 | 26.755468368530273 | 0.5206537842750549 | 0.23930974304676056 | 0.653376579284668 | 0.8633849024772644 | 1635378410781933568 | 15736760328576 | b'55-849-1' | 0.6320805024726543 | 0.44587838095402044 | 0.41250283253756015 | -0.17481316927621393 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2,057,045 | 25.898868560791016 | 0.6508009723190962 | 172.3136755413185 | 0 | 0 | 54 | 54 | 54 | 54 | 84 | 3 | 6.386378765106201 | 1.8042501324089244e-05 | 2.2653496265411377 | 16.006806970347426 | -0.42319686025158043 | 0.24974147639642075 | 0.00821441039443016 | 0.2133195698261261 | -0.000805279181804508 | 70 | 16.006807041815204 | 317.0782357688112 | 103561 | -42.92178788756781 | 8 | 5.0743069397419776 | 0.2840892420661878 | -0.0308084636926651 | -0.03397708386182785 | 4114975.455725508 | 3447.5776608146016 | 8.988851940956916 | 69 | b'NOT_AVAILABLE' | -4.440524133201202 | 0.04743297901782237 | 21.970772995655643 | 0.07846893118669047 | 0.3920176327228546 | 314.74170043792924 | 0.08548042178153992 | 0.2773321068969684 | 0.2473779171705246 | -0.0006040430744178593 | 0.11652233451604843 | 1595738 | 2015.0 | -18.078920364379883 | -17.731922149658203 | 38.27400588989258 | 27.63787269592285 | 0.29217642545700073 | 0.11402469873428345 | 0.0404343381524086 | 0.937016487121582 | 1635378410781933568 | 6917488998546378368 | b'' | 0.19707124773395138 | 0.13871698568448773 | -0.12900211309069443 | 0.054342703136315784 |
| 2,057,046 | nan | 0.17407523451856974 | 28.886549102578012 | 0 | 2 | 54 | 52 | 54 | 54 | 84 | 5 | 1.9612410068511963 | 2.415467497485224e-05 | 24.774322509765625 | 16.12926993546893 | -0.32497534368232894 | 0.14823365569199975 | 0.8842677474021912 | -0.9121489524841309 | -0.8994856476783752 | 70 | 16.129270018016896 | 317.0105462544942 | -2147483648 | -42.98947742356782 | 7 | 1.6983480817439922 | 0.7410137777358506 | -0.9793509840965271 | -0.9959075450897217 | 1202425.5197785893 | 871.2480333575235 | 10.324624601435723 | 59 | b'NOT_AVAILABLE' | -10.401225111268962 | 1.4016954983272711 | -1.2835612990841874 | 2.7416807292293637 | 0.980453610420227 | 314.64381789311193 | 0.8981446623802185 | 0.3590974400544809 | 0.9818224906921387 | -0.9802247881889343 | -0.9827051162719727 | 2019553 | 2015.0 | -87.07184600830078 | -31.574886322021484 | -36.37055206298828 | 29.130958557128906 | 0.22651544213294983 | 0.07730517536401749 | 0.2675701975822449 | 0.9523505568504333 | 1635378410781933568 | 6917493705830041600 | b'5179-753-1' | 0.5888074481016426 | 0.4137467499267554 | -0.38568304807850484 | 0.16357391078619246 |
| 2,057,047 | nan | 0.47235246463190794 | 92.12190417660749 | 2 | 0 | 34 | 36 | 36 | 36 | 84 | 5 | 4.68601131439209 | 2.138371200999245e-05 | 3.9279115200042725 | 15.92496896432183 | -0.34317732044320387 | 0.20902981533215972 | -0.2000708132982254 | 0.31042322516441345 | -0.3574342727661133 | 70 | 15.924968943694909 | 317.6408327998631 | -2147483648 | -42.359190842094414 | 6 | 6.036938108863445 | 0.39688014089787665 | -0.7275367975234985 | -0.25934046506881714 | 3268640.5253614695 | 4918.5087736624755 | 9.238852161621992 | 51 | b'NOT_AVAILABLE' | -27.852344752672245 | 1.2778575351686428 | 15.713555906870294 | 0.9411842746983148 | -0.1186852976679802 | 315.2828795933192 | -0.47665935754776 | 0.4722647631556871 | 0.704002320766449 | -0.77033931016922 | 0.12704335153102875 | 788948 | 2015.0 | -21.23501205444336 | 20.132535934448242 | 33.55913162231445 | 26.732301712036133 | 0.41511622071266174 | 0.5105549693107605 | 0.15976844727993011 | 0.9333845376968384 | 1635378410781933568 | 6917504975824469248 | b'5192-877-1' | 0.16564688621402263 | 0.11770477437507047 | -0.10732559074953243 | 0.045449912782963474 |
| 2,057,048 | nan | 0.3086465263182493 | 76.66564461310193 | 1 | 2 | 52 | 51 | 53 | 53 | 84 | 5 | 3.154139280319214 | 1.9043474821955897e-05 | 9.627826690673828 | 16.193728871838935 | -0.22811360043544882 | 0.131650037775767 | 0.3082593083381653 | -0.5279345512390137 | -0.4065483510494232 | 70 | 16.193728933791913 | 317.1363617703344 | -2147483648 | -42.86366191921117 | 7 | 1.484142306295484 | 0.34860128377301614 | -0.7272516489028931 | -0.9375584125518799 | 4009408.3172682906 | 1929.9834553649182 | 9.017069346445364 | 60 | b'NOT_AVAILABLE' | 1.8471079057572073 | 0.7307171627866237 | 11.352888915160555 | 1.219847308406543 | 0.7511345148086548 | 314.7406481637209 | 0.41397571563720703 | 0.19205296641778563 | 0.7539510726928711 | -0.7239754796028137 | -0.7911394238471985 | 868066 | 2015.0 | -89.73970794677734 | -25.196216583251953 | -35.13546371459961 | 29.041872024536133 | 0.21430812776088715 | 0.06784655898809433 | 0.2636755108833313 | 0.9523414969444275 | 1635378410781933568 | 6917517998165066624 | b'5179-1401-1' | 0.6737898352187435 | 0.4742760432178817 | -0.44016428945980135 | 0.18791055094922077 |
| 2,057,049 | nan | 0.4329850465924866 | 60.789771079095715 | 0 | 0 | 26 | 26 | 26 | 26 | 84 | 5 | 4.3140177726745605 | 2.7940122890868224e-05 | 4.742301940917969 | 16.135962442685898 | -0.22130081624351935 | 0.2686748166142929 | -0.46605369448661804 | 0.30018869042396545 | -0.3290684223175049 | 70 | 16.13596246842634 | 317.3575812619557 | -2147483648 | -42.642442417388324 | 5 | 2.680111343641743 | 0.4507741964825321 | -0.689416229724884 | -0.1735922396183014 | 2074338.153903563 | 4136.498086035368 | 9.732571175024953 | 31 | b'NOT_AVAILABLE' | 3.15173423618292 | 1.4388911228835037 | 2.897878776243949 | 1.0354817855168323 | -0.21837876737117767 | 314.960730599014 | -0.4467950165271759 | 0.49182050944792216 | 0.7087226510047913 | -0.8360105156898499 | 0.2156151533126831 | 1736132 | 2015.0 | -63.01319885253906 | 18.303699493408203 | -49.05630111694336 | 28.76698875427246 | 0.3929939866065979 | 0.32352808117866516 | 0.24211134016513824 | 0.9409775733947754 | 1635378410781933568 | 6917521537218608640 | b'5179-1719-1' | 0.3731188267130712 | 0.2636519673685346 | -0.24280110216486334 | 0.10369630532457579 |
由于RA和Dec是以度为单位,而ra_error和dec_error是以毫角秒为单位,我们需要将它们放在同一尺度上
[62]:
tgas['ra_error'] = tgas.ra_error / 1000 / 3600
tgas['dec_error'] = tgas.dec_error / 1000 / 3600
我们现在让Vaex计算出笛卡尔坐标x、y和z的协方差矩阵。然后从数据集中抽取50个样本进行可视化。
[63]:
tgas.propagate_uncertainties([tgas.x, tgas.y, tgas.z])
tgas_50 = tgas.sample(50, random_state=42)
对于这个数据集的这个小子集,我们可以可视化不确定性,包括有协方差和没有协方差的情况。
[64]:
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
tgas_50.scatter(tgas_50.x, tgas_50.y, xerr=tgas_50.x_uncertainty, yerr=tgas_50.y_uncertainty, cov=tgas_50.y_x_covariance)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
从第二个图中,我们看到显示误差椭圆(非常窄以至于它们看起来像线)而不是误差条,揭示了在这种情况下距离信息主导了不确定性。
即时编译#
让我们从一个计算球体表面上两点之间角距离的函数开始。该函数的输入是一对以弧度表示的2个角坐标。
[65]:
import vaex
import numpy as np
# From http://pythonhosted.org/pythran/MANUAL.html
def arc_distance(theta_1, phi_1, theta_2, phi_2):
"""
Calculates the pairwise arc distance
between all points in vector a and b.
"""
temp = (np.sin((theta_2-2-theta_1)/2)**2
+ np.cos(theta_1)*np.cos(theta_2) * np.sin((phi_2-phi_1)/2)**2)
distance_matrix = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))
return distance_matrix
让我们使用2015年的纽约出租车数据集,可以以hdf5格式下载
[66]:
# nytaxi = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')
nytaxi = vaex.open('/Users/jovan/Work/vaex-work/vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')
# lets use just 20% of the data, since we want to make sure it fits
# into memory (so we don't measure just hdd/ssd speed)
nytaxi.set_active_fraction(0.2)
尽管上述函数期望接收Numpy数组,Vaex可以传入列或表达式,这将延迟执行直到需要时,并将结果表达式作为虚拟列添加。
[67]:
nytaxi['arc_distance'] = arc_distance(nytaxi.pickup_longitude * np.pi/180,
nytaxi.pickup_latitude * np.pi/180,
nytaxi.dropoff_longitude * np.pi/180,
nytaxi.dropoff_latitude * np.pi/180)
当我们计算出租车行程的平均角距离时,会遇到一些无效数据,这些数据会给出警告,在此演示中我们可以安全地忽略这些警告。
[68]:
%%time
nytaxi.mean(nytaxi.arc_distance)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sqrt
return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in sin
return function(*args, **kwargs)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/functions.py:121: RuntimeWarning: invalid value encountered in cos
return function(*args, **kwargs)
CPU times: user 44.5 s, sys: 5.03 s, total: 49.5 s
Wall time: 6.14 s
[68]:
array(1.99993285)
这个计算使用了相当多的重型数学操作,并且由于它(内部)使用了Numpy数组,也使用了相当多的临时数组。我们可以通过基于numba、pythran,或者如果你有NVIDIA显卡的话,基于cuda的即时编译来优化这个计算。选择性能最好或最容易安装的选项。
[69]:
nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_numba()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_pythran()
# nytaxi['arc_distance_jit'] = nytaxi.arc_distance.jit_cuda()
[70]:
%%time
nytaxi.mean(nytaxi.arc_distance_jit)
/Users/jovan/PyLibrary/vaex/packages/vaex-core/vaex/expression.py:1038: RuntimeWarning: invalid value encountered in f
return self.f(*args, **kwargs)
CPU times: user 25.7 s, sys: 330 ms, total: 26 s
Wall time: 2.31 s
[70]:
array(1.9999328)
在这种情况下,我们可以获得显著的加速(\(\sim 3 x\))。
并行计算#
正如在选择部分提到的,Vaex 可以并行进行计算。通常这是自动处理的,例如,当向方法传递多个选择或向统计函数之一传递多个参数时。然而,有时很难或无法在一个表达式中表示计算,我们需要进行所谓的“延迟”计算,类似于 joblib 和 dask 中的做法。
[71]:
import vaex
df = vaex.example()
limits = [-10, 10]
delayed_count = df.count(df.E, binby=df.x, limits=limits,
shape=4, delay=True)
delayed_count
[71]:
<vaex.promise.Promise at 0x7ffbd64072d0>
请注意,现在返回的值是一个promise(TODO:更Pythonic的方式是返回一个Future)。这可能会有所变化,处理这种情况的最佳方式是使用delayed装饰器。并在需要结果时调用DataFrame.execute。
除了上述的延迟计算外,我们还安排了更多的计算任务,使得计数和平均值能够并行执行,从而只需对数据进行一次遍历。我们使用vaex.delayed装饰器来安排两个额外函数的执行,并使用df.execute()来运行整个管道。
[72]:
delayed_sum = df.sum(df.E, binby=df.x, limits=limits,
shape=4, delay=True)
@vaex.delayed
def calculate_mean(sums, counts):
print('calculating mean')
return sums/counts
print('before calling mean')
# since calculate_mean is decorated with vaex.delayed
# this now also returns a 'delayed' object (a promise)
delayed_mean = calculate_mean(delayed_sum, delayed_count)
# if we'd like to perform operations on that, we can again
# use the same decorator
@vaex.delayed
def print_mean(means):
print('means', means)
print_mean(delayed_mean)
print('before calling execute')
df.execute()
# Using the .get on the promise will also return the result
# However, this will only work after execute, and may be
# subject to change
means = delayed_mean.get()
print('same means', means)
before calling mean
before calling execute
calculating mean
means [ -94323.68051598 -118749.23850834 -119119.46292653 -95021.66183457]
same means [ -94323.68051598 -118749.23850834 -119119.46292653 -95021.66183457]
扩展Vaex#
Vaex 可以通过多种机制进行扩展。
添加函数#
使用 vaex.register_function decorator API 来添加新函数。
[73]:
import vaex
import numpy as np
@vaex.register_function()
def add_one(ar):
return ar+1
该函数可以使用df.func访问器调用,以返回一个新的表达式。在Vaex上下文中进行评估时,每个作为表达式的参数将被Numpy数组替换。
[74]:
df = vaex.from_arrays(x=np.arange(4))
df.func.add_one(df.x)
[74]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0 1
1 2
2 3
3 4
默认情况下(传递on_expression=True),该函数也可作为表达式上的方法使用,其中表达式本身会自动设置为第一个参数(因为这是一个非常常见的用例)。
[75]:
df.x.add_one()
[75]:
Expression = add_one(x)
Length: 4 dtype: int64 (expression)
-----------------------------------
0 1
1 2
2 3
3 4
如果第一个参数不是表达式,请传递 on_expression=True,并使用 df.func.,通过该函数构建一个新的表达式:
[76]:
@vaex.register_function(on_expression=False)
def addmul(a, b, x, y):
return a*x + b * y
[77]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df.func.addmul(2, 3, df.x, df.y)
[77]:
Expression = addmul(2, 3, x, y)
Length: 4 dtype: int64 (expression)
-----------------------------------
0 0
1 5
2 16
3 33
这些表达式可以按预期添加为虚拟列。
[78]:
df = vaex.from_arrays(x=np.arange(4))
df['y'] = df.x**2
df['z'] = df.func.addmul(2, 3, df.x, df.y)
df['w'] = df.x.add_one()
df
[78]:
| # | x | y | z | w |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 1 | 5 | 2 |
| 2 | 2 | 4 | 16 | 3 |
| 3 | 3 | 9 | 33 | 4 |
添加DataFrame访问器#
在添加操作Dataframes的方法时,将它们分组到一个命名空间中是有意义的。
[79]:
@vaex.register_dataframe_accessor('scale', override=True)
class ScalingOps(object):
def __init__(self, df):
self.df = df
def mul(self, a):
df = self.df.copy()
for col in df.get_column_names(strings=False):
if df[col].dtype:
df[col] = df[col] * a
return df
def add(self, a):
df = self.df.copy()
for col in df.get_column_names(strings=False):
if df[col].dtype:
df[col] = df[col] + a
return df
[80]:
df.scale.add(1)
[80]:
| # | x | y | z | w |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 2 |
| 1 | 2 | 2 | 6 | 3 |
| 2 | 3 | 5 | 17 | 4 |
| 3 | 4 | 10 | 34 | 5 |
[81]:
df.scale.mul(2)
[81]:
| # | x | y | z | w |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 2 | 10 | 4 |
| 2 | 4 | 8 | 32 | 6 |
| 3 | 6 | 18 | 66 | 8 |
便捷方法#
获取列名#
我们经常希望处理DataFrame中的一部分列。使用get_column_names方法,Vaex使得获取你需要的精确列变得非常容易和方便。默认情况下,get_column_names返回所有列:
[1]:
import vaex
df = vaex.datasets.titanic()
print(df.get_column_names())
['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest']
相同的方法有几个参数,使得获取所需的列子集变得容易。例如,可以传递一个正则表达式来根据列名选择列。在下面的单元格中,我们选择所有名称长度为5个字符的列:
[2]:
print(df.get_column_names(regex='^[a-zA-Z]{5}$'))
['sibsp', 'parch', 'cabin']
我们还可以根据类型选择列。下面我们选择所有整数或浮点数的列:
[3]:
df.get_column_names(dtype=['int', 'float'])
[3]:
['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
逃生口:apply#
如果计算无法表示为Vaex表达式,可以使用apply方法作为最后的手段。如果您想要应用的函数是用纯Python编写的,或者来自第三方库,并且难以或无法向量化,这可能很有用。
我们认为apply应该仅作为最后的手段使用,因为它需要使用多进程(产生新进程)来避免Python全局解释器锁(GIL),以利用多核。这带来的代价是必须在主进程和子进程之间传输数据。
这是一个使用apply方法的示例:
[1]:
import vaex
def slow_is_prime(x):
return x > 1 and all((x % i) != 0 for i in range(2, x))
df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))
# you need to explicitly specify which arguments you need
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
df.head(10)
[1]:
| # | x | 是否为质数 |
|---|---|---|
| 0 | 0 | 假 |
| 1 | 1 | 假 |
| 2 | 2 | True |
| 3 | 3 | True |
| 4 | 4 | 错误 |
| 5 | 5 | True |
| 6 | 6 | 错误 |
| 7 | 7 | 真 |
| 8 | 8 | 假 |
| 9 | 9 | 错误 |
[2]:
prime_count = df.is_prime.sum()
print(f'There are {prime_count} prime numbers between 0 and {len(df)}')
There are 9592 prime number between 0 and 100000
[3]:
# both of these are equivalent
df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])
# but this form only works for a single argument
df['is_prime'] = df.x.apply(slow_is_prime)
何时不使用apply#
当你的函数可以被向量化时,你不应该使用apply。当你使用Vaex的表达式系统时,我们知道你在做什么,我们看到表达式,并且可以操作它以实现最佳性能。一个apply函数就像一个黑盒子,我们无法对它做任何事情,比如JIT编译。
[4]:
df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))
[5]:
# ideal case
df['y'] = df.x**2
[6]:
%%timeit
df.y.sum()
29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[7]:
# will transfer the data to child processes, and execute the ** operation in Python for each element
df['y_slow'] = df.x.apply(lambda x: x**2)
[8]:
%%timeit
df.y_slow.sum()
353 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[9]:
# bad idea: it will transfer the data to the child process, where it will be executed in vectorized form
df['y_slow_vectorized'] = df.x.apply(lambda x: x**2, vectorize=True)
[10]:
%%timeit
df.y_slow_vectorized.sum()
82.8 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[11]:
# bad idea: same performance as just dy['y'], but we lose the information about what was done
df['y_fast'] = df.x.apply(lambda x: x**2, vectorize=True, multiprocessing=False)
[12]:
%%timeit
df.y_fast.sum()
28.8 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)