箭头#

Vaex 支持 Arrow。我们将通过快速查看一个不适合内存的大型数据集来展示 vaex+arrow 的功能。2015 年的纽约出租车数据集包含大约 1.5 亿行关于纽约出租车行程的信息,大小约为 23GB。你可以在这里下载它:

如果你想将其转换为箭头格式,请使用以下代码:

ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')
[1]:
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
-rw-r--r--  1 maartenbreddels  staff    23G Oct 31 18:56 /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
[3]:
import vaex

立即打开#

打开文件是瞬间完成的,因为没有数据被复制到内存中。数据只是被内存映射,这是一种只在需要时读取数据的技术。

[4]:
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
[5]:
df
[5]:
#供应商ID下车星期几下车小时下车纬度下车经度额外费用车费金额改进附加费MTA税乘客数量支付类型上车星期几上车小时上车纬度上车经度小费金额通行费金额总金额下车时间上车时间行程距离
023.019.040.75061798095703-73.974784851074221.012.00.30.5113.019.040.7501106262207-73.9938964843753.250.017.05numpy.datetime64('2015-01-15T19:23:42.000000000')numpy.datetime64('2015-01-15T19:05:39.000000000')1.59
115.020.040.75910949707031-73.994415283203120.514.50.30.5115.020.040.7242431640625-74.001647949218752.00.017.8numpy.datetime64('2015-01-10T20:53:28.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')3.3
215.020.040.82441329956055-73.951820373535160.59.50.30.5125.020.040.80278778076172-73.963340759277340.00.010.8numpy.datetime64('2015-01-10T20:43:41.000000000')numpy.datetime64('2015-01-10T20:33:38.000000000')1.8
315.020.040.71998596191406-74.004325866699230.53.50.30.5125.020.040.71381759643555-74.009086608886720.00.04.8numpy.datetime64('2015-01-10T20:35:31.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')0.5
415.020.040.742652893066406-74.004180908203120.515.00.30.5125.020.040.762428283691406-73.971176147460940.00.016.3numpy.datetime64('2015-01-10T20:52:58.000000000')numpy.datetime64('2015-01-10T20:33:39.000000000')3.0
..................................................................
146,112,98424.00.040.722469329833984-73.986213684082030.57.50.30.5513.023.040.72087097167969-73.993812561035161.760.010.56numpy.datetime64('2016-01-01T00:08:18.000000000')numpy.datetime64('2015-12-31T23:59:56.000000000')1.2
146,112,98514.00.040.75238800048828-73.939514160156250.57.50.30.5223.023.040.76028060913085-73.965270996093750.00.08.8numpy.datetime64('2016-01-01T00:05:19.000000000')numpy.datetime64('2015-12-31T23:59:58.000000000')2.0
146,112,98614.00.040.69329833984375-73.98867034912110.513.50.30.5223.023.040.73907852172852-73.987297058105470.00.014.8numpy.datetime64('2016-01-01T00:12:55.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')3.8
146,112,98724.00.040.705322265625-74.017120361328120.58.50.30.5123.023.040.72569274902344-73.997558593750.00.09.8numpy.datetime64('2016-01-01T00:10:26.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.96
146,112,98824.00.040.76057052612305-73.990982055664060.513.50.30.5113.023.040.76725769042969-73.984397888183582.960.017.76numpy.datetime64('2016-01-01T00:21:30.000000000')numpy.datetime64('2015-12-31T23:59:59.000000000')1.06

快速可视化1.46亿行数据#

可以看出,这个数据集包含1.46亿行。使用plot,我们可以快速生成数据内容的概览。上车地点很好地勾勒出了曼哈顿的轮廓。

[6]:
df.viz.heatmap(df.pickup_longitude, df.pickup_latitude, f='log')
../_images/guides_arrow_8_0.png
[7]:
df.total_amount.minmax()
[7]:
array([-4.9630000e+02,  3.9506116e+06])

数据清洗:异常值#

从total_amount列(人们支付了多少)可以看出,这个数据集中包含异常值。通过快速的1d图,我们可以看到合理的数据过滤方法。

[8]:
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
[8]:
[<matplotlib.lines.Line2D at 0x121d26320>]
../_images/guides_arrow_11_1.png
[9]:
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]

浅拷贝#

这个过滤后的数据集没有复制任何数据(否则会消耗我们大约23GB的内存)。相反,创建了数据的浅拷贝,并使用布尔掩码来跟踪应该使用哪些行。

[10]:
dff['ratio'] = dff.tip_amount/dff.total_amount

虚拟列#

新列 ratio 目前还没有进行任何计算,它只存储了表达式并且不占用任何内存。然而,这个新的(虚拟)列可以像普通列一样用于计算。

[11]:
dff.ratio.mean()
<string>:1: RuntimeWarning: invalid value encountered in true_divide
[11]:
0.09601926650107262

结果#

我们的最终结果,小费的百分比,可以很容易地为这个大数据集计算出来,它不需要任何过多的内存。

互操作性#

由于数据以Arrow数组形式存在,我们可以将它们传递给其他库,如pandas,甚至传递给其他进程。

[12]:
arrow_table = df.to_arrow_table()
arrow_table
[12]:
pyarrow.Table
VendorID: int64
dropoff_dayofweek: double
dropoff_hour: double
dropoff_latitude: double
dropoff_longitude: double
extra: double
fare_amount: double
improvement_surcharge: double
mta_tax: double
passenger_count: int64
payment_type: int64
pickup_dayofweek: double
pickup_hour: double
pickup_latitude: double
pickup_longitude: double
tip_amount: double
tolls_amount: double
total_amount: double
tpep_dropoff_datetime: timestamp[ns]
tpep_pickup_datetime: timestamp[ns]
trip_distance: double
[13]:
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas
[13]:
供应商ID 下车星期几 下车小时 下车纬度 下车经度 额外费用 车费金额 改进附加费 MTA税 乘客数量 ... 上车星期几 上车小时 上车纬度 上车经度 小费金额 通行费金额 总金额 下车时间 上车时间 行程距离
0 2 3.0 19.0 40.750618 -73.974785 1.0 12.0 0.3 0.5 1 ... 3.0 19.0 40.750111 -73.993896 3.25 0.00 17.05 2015-01-15 19:23:42 2015-01-15 19:05:39 1.59
1 1 5.0 20.0 40.759109 -73.994415 0.5 14.5 0.3 0.5 1 ... 5.0 20.0 40.724243 -74.001648 2.00 0.00 17.80 2015-01-10 20:53:28 2015-01-10 20:33:38 3.30
2 1 5.0 20.0 40.824413 -73.951820 0.5 9.5 0.3 0.5 1 ... 5.0 20.0 40.802788 -73.963341 0.00 0.00 10.80 2015-01-10 20:43:41 2015-01-10 20:33:38 1.80
3 1 5.0 20.0 40.719986 -74.004326 0.5 3.5 0.3 0.5 1 ... 5.0 20.0 40.713818 -74.009087 0.00 0.00 4.80 2015-01-10 20:35:31 2015-01-10 20:33:39 0.50
4 1 5.0 20.0 40.742653 -74.004181 0.5 15.0 0.3 0.5 1 ... 5.0 20.0 40.762428 -73.971176 0.00 0.00 16.30 2015-01-10 20:52:58 2015-01-10 20:33:39 3.00
5 1 5.0 20.0 40.758194 -73.986977 0.5 27.0 0.3 0.5 1 ... 5.0 20.0 40.774048 -73.874374 6.70 5.33 40.33 2015-01-10 20:53:52 2015-01-10 20:33:39 9.00
6 1 5.0 20.0 40.749634 -73.992470 0.5 14.0 0.3 0.5 1 ... 5.0 20.0 40.726009 -73.983276 0.00 0.00 15.30 2015-01-10 20:58:31 2015-01-10 20:33:39 2.20
7 1 5.0 20.0 40.726326 -73.995010 0.5 7.0 0.3 0.5 3 ... 5.0 20.0 40.734142 -74.002663 1.66 0.00 9.96 2015-01-10 20:42:20 2015-01-10 20:33:39 0.80
8 1 5.0 21.0 40.759357 -73.987595 0.0 52.0 0.3 0.5 3 ... 5.0 20.0 40.644356 -73.783043 0.00 5.33 58.13 2015-01-10 21:11:35 2015-01-10 20:33:39 18.20
9 1 5.0 20.0 40.759365 -73.985916 0.5 6.5 0.3 0.5 2 ... 5.0 20.0 40.767948 -73.985588 1.55 0.00 9.35 2015-01-10 20:40:44 2015-01-10 20:33:40 0.90
10 1 5.0 20.0 40.728584 -74.004395 0.5 7.0 0.3 0.5 1 ... 5.0 20.0 40.723103 -73.988617 1.66 0.00 9.96 2015-01-10 20:41:39 2015-01-10 20:33:40 0.90
11 1 5.0 20.0 40.757217 -73.967407 0.5 7.5 0.3 0.5 1 ... 5.0 20.0 40.751419 -73.993782 1.00 0.00 9.80 2015-01-10 20:43:26 2015-01-10 20:33:41 1.10
12 1 5.0 20.0 40.707726 -74.009773 0.5 3.0 0.3 0.5 1 ... 5.0 20.0 40.704376 -74.008362 0.00 0.00 4.30 2015-01-10 20:35:23 2015-01-10 20:33:41 0.30
13 1 5.0 21.0 40.735210 -73.997345 0.5 19.0 0.3 0.5 1 ... 5.0 20.0 40.760448 -73.973946 3.00 0.00 23.30 2015-01-10 21:03:04 2015-01-10 20:33:41 3.10
14 1 5.0 20.0 40.739895 -73.995216 0.5 6.0 0.3 0.5 1 ... 5.0 20.0 40.731777 -74.006721 0.00 0.00 7.30 2015-01-10 20:39:23 2015-01-10 20:33:41 1.10
15 2 3.0 19.0 40.757889 -73.983978 1.0 16.5 0.3 0.5 1 ... 3.0 19.0 40.739811 -73.976425 4.38 0.00 22.68 2015-01-15 19:32:00 2015-01-15 19:05:39 2.38
16 2 3.0 19.0 40.786858 -73.955124 1.0 12.5 0.3 0.5 5 ... 3.0 19.0 40.754246 -73.968704 0.00 0.00 14.30 2015-01-15 19:21:00 2015-01-15 19:05:40 2.83
17 2 3.0 19.0 40.785782 -73.952713 1.0 26.0 0.3 0.5 5 ... 3.0 19.0 40.769581 -73.863060 8.08 5.33 41.21 2015-01-15 19:28:18 2015-01-15 19:05:40 8.33
18 2 3.0 19.0 40.786083 -73.980850 1.0 11.5 0.3 0.5 1 ... 3.0 19.0 40.779423 -73.945541 0.00 0.00 13.30 2015-01-15 19:20:36 2015-01-15 19:05:41 2.37
19 2 3.0 19.0 40.718590 -73.952377 1.0 21.5 0.3 0.5 2 ... 3.0 19.0 40.774010 -73.874458 4.50 0.00 27.80 2015-01-15 19:20:22 2015-01-15 19:05:41 7.13
20 2 3.0 19.0 40.714596 -73.998924 1.0 17.5 0.3 0.5 1 ... 3.0 19.0 40.751896 -73.976601 0.00 0.00 19.30 2015-01-15 19:31:00 2015-01-15 19:05:41 3.60
21 2 3.0 19.0 40.734650 -73.999939 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.745079 -73.994957 1.62 0.00 8.92 2015-01-15 19:10:22 2015-01-15 19:05:41 0.89
22 2 3.0 19.0 40.735512 -74.003563 1.0 5.5 0.3 0.5 1 ... 3.0 19.0 40.747063 -74.000938 1.30 0.00 8.60 2015-01-15 19:10:55 2015-01-15 19:05:41 0.96
23 2 3.0 19.0 40.704220 -74.007919 1.0 6.5 0.3 0.5 2 ... 3.0 19.0 40.717892 -74.002777 1.50 0.00 9.80 2015-01-15 19:12:36 2015-01-15 19:05:41 1.25
24 2 3.0 19.0 40.761856 -73.978172 1.0 11.5 0.3 0.5 5 ... 3.0 19.0 40.736362 -73.997459 2.50 0.00 15.80 2015-01-15 19:22:11 2015-01-15 19:05:41 2.11
25 2 3.0 19.0 40.811089 -73.953339 1.0 7.5 0.3 0.5 5 ... 3.0 19.0 40.823994 -73.952278 1.70 0.00 11.00 2015-01-15 19:14:05 2015-01-15 19:05:41 1.15
26 2 3.0 19.0 40.734890 -73.988609 1.0 9.0 0.3 0.5 1 ... 3.0 19.0 40.750080 -73.991127 0.00 0.00 10.80 2015-01-15 19:16:18 2015-01-15 19:05:42 1.53
27 2 3.0 19.0 40.743530 -73.985603 0.0 52.0 0.3 0.5 1 ... 3.0 19.0 40.644127 -73.786575 6.00 5.33 64.13 2015-01-15 19:49:07 2015-01-15 19:05:42 18.06
28 2 3.0 19.0 40.757721 -73.994514 1.0 10.0 0.3 0.5 1 ... 3.0 19.0 40.741447 -73.993668 2.36 0.00 14.16 2015-01-15 19:18:33 2015-01-15 19:05:42 1.76
29 2 3.0 19.0 40.704689 -74.009079 1.0 17.5 0.3 0.5 6 ... 3.0 19.0 40.744083 -73.985291 3.70 0.00 23.00 2015-01-15 19:21:40 2015-01-15 19:05:42 5.19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9970 1 4.0 11.0 40.719917 -73.955521 0.0 20.0 0.3 0.5 1 ... 4.0 10.0 40.725979 -74.009071 4.00 0.00 24.80 2015-01-30 11:20:08 2015-01-30 10:51:40 3.70
9971 1 4.0 10.0 40.720398 -73.984940 1.0 6.5 0.3 0.5 1 ... 4.0 10.0 40.732452 -73.985001 1.65 0.00 9.95 2015-01-30 10:58:58 2015-01-30 10:51:40 1.10
9972 1 4.0 11.0 40.755405 -74.002457 0.0 8.5 0.3 0.5 2 ... 4.0 10.0 40.751358 -73.990479 1.00 0.00 10.30 2015-01-30 11:03:41 2015-01-30 10:51:41 0.70
9973 2 1.0 19.0 40.763626 -73.969666 1.0 24.5 0.3 0.5 1 ... 1.0 18.0 40.708790 -74.017281 5.10 0.00 31.40 2015-01-13 19:22:18 2015-01-13 18:55:41 7.08
9974 2 1.0 19.0 40.772366 -73.960800 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.780003 -73.954681 1.00 0.00 8.30 2015-01-13 19:02:03 2015-01-13 18:55:41 0.64
9975 2 1.0 19.0 40.733429 -73.984154 1.0 9.0 0.3 0.5 1 ... 1.0 18.0 40.749680 -73.991531 0.00 0.00 10.80 2015-01-13 19:06:56 2015-01-13 18:55:41 1.67
9976 2 1.0 19.0 40.774780 -73.957779 1.0 20.0 0.3 0.5 3 ... 1.0 18.0 40.751801 -74.002327 2.00 0.00 23.80 2015-01-13 19:18:39 2015-01-13 18:55:42 5.28
9977 2 1.0 19.0 40.751698 -73.989746 1.0 8.5 0.3 0.5 2 ... 1.0 18.0 40.768433 -73.986137 0.00 0.00 10.30 2015-01-13 19:06:38 2015-01-13 18:55:42 1.38
9978 2 1.0 19.0 40.752941 -73.977470 1.0 7.5 0.3 0.5 1 ... 1.0 18.0 40.745071 -73.987068 1.00 0.00 10.30 2015-01-13 19:05:34 2015-01-13 18:55:42 0.88
9979 2 1.0 19.0 40.735130 -73.976120 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.751259 -73.977814 0.00 0.00 10.30 2015-01-13 19:05:41 2015-01-13 18:55:42 1.58
9980 2 1.0 19.0 40.745541 -73.984383 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.731110 -74.001350 0.00 0.00 10.30 2015-01-13 19:05:32 2015-01-13 18:55:42 1.58
9981 2 1.0 19.0 40.793671 -73.974327 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.791222 -73.965118 0.00 0.00 6.80 2015-01-13 19:00:05 2015-01-13 18:55:42 0.63
9982 2 1.0 19.0 40.754639 -73.986343 1.0 11.0 0.3 0.5 1 ... 1.0 18.0 40.764175 -73.968994 1.00 0.00 13.80 2015-01-13 19:11:57 2015-01-13 18:55:43 1.63
9983 2 1.0 18.0 40.723721 -73.989494 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.714985 -73.992409 2.00 0.00 8.30 2015-01-13 18:59:19 2015-01-13 18:55:43 0.70
9984 2 1.0 19.0 40.774590 -73.963249 1.0 5.5 0.3 0.5 5 ... 1.0 18.0 40.764881 -73.968529 1.30 0.00 8.60 2015-01-13 19:01:19 2015-01-13 18:55:44 0.94
9985 2 1.0 19.0 40.774872 -73.982613 1.0 7.0 0.3 0.5 1 ... 1.0 18.0 40.762344 -73.985695 1.60 0.00 10.40 2015-01-13 19:03:54 2015-01-13 18:55:44 1.04
9986 2 1.0 19.0 40.787998 -73.953888 1.0 5.0 0.3 0.5 2 ... 1.0 18.0 40.779526 -73.957619 1.20 0.00 8.00 2015-01-13 19:00:06 2015-01-13 18:55:44 0.74
9987 2 1.0 19.0 40.790218 -73.975128 1.0 11.5 0.3 0.5 1 ... 1.0 18.0 40.762226 -73.985916 2.50 0.00 15.80 2015-01-13 19:10:46 2015-01-13 18:55:44 2.19
9988 2 1.0 19.0 40.739487 -73.989059 1.0 9.5 0.3 0.5 1 ... 1.0 18.0 40.725056 -73.984329 2.10 0.00 13.40 2015-01-13 19:08:40 2015-01-13 18:55:44 1.48
9989 2 1.0 19.0 40.780548 -73.959030 1.0 8.5 0.3 0.5 1 ... 1.0 18.0 40.778542 -73.981949 1.00 0.00 11.30 2015-01-13 19:04:44 2015-01-13 18:55:45 1.83
9990 2 1.0 19.0 40.761524 -73.960602 1.0 15.0 0.3 0.5 1 ... 1.0 18.0 40.746319 -74.001114 0.00 0.00 16.80 2015-01-13 19:14:59 2015-01-13 18:55:45 3.27
9991 2 1.0 19.0 40.720646 -73.989716 1.0 8.0 0.3 0.5 1 ... 1.0 18.0 40.738167 -73.987434 1.00 0.00 10.80 2015-01-13 19:04:58 2015-01-13 18:55:45 1.56
9992 2 1.0 19.0 40.795898 -73.972610 1.0 20.5 0.3 0.5 1 ... 1.0 18.0 40.740582 -73.989738 4.30 0.00 26.60 2015-01-13 19:18:18 2015-01-13 18:55:45 5.40
9993 2 1.0 18.0 40.769939 -73.981316 1.0 4.5 0.3 0.5 1 ... 1.0 18.0 40.772015 -73.979416 1.10 0.00 7.40 2015-01-13 18:59:40 2015-01-13 18:55:45 0.34
9994 2 4.0 18.0 40.773521 -73.955353 1.0 31.0 0.3 0.5 1 ... 4.0 18.0 40.713215 -74.013542 5.00 0.00 37.80 2015-01-23 18:59:52 2015-01-23 18:22:55 9.05
9995 2 4.0 18.0 40.774670 -73.947845 1.0 11.5 0.3 0.5 1 ... 4.0 18.0 40.773186 -73.978043 0.00 0.00 13.30 2015-01-23 18:37:44 2015-01-23 18:22:55 2.32
9996 2 4.0 18.0 40.758148 -73.985626 1.0 8.5 0.3 0.5 2 ... 4.0 18.0 40.752003 -73.973198 0.00 0.00 10.30 2015-01-23 18:34:48 2015-01-23 18:22:56 0.92
9997 2 4.0 18.0 40.768131 -73.964516 1.0 10.5 0.3 0.5 1 ... 4.0 18.0 40.740456 -73.986252 2.46 0.00 14.76 2015-01-23 18:33:58 2015-01-23 18:22:56 2.36
9998 2 4.0 18.0 40.759171 -73.975189 1.0 6.5 0.3 0.5 3 ... 4.0 18.0 40.770500 -73.981323 2.08 0.00 10.38 2015-01-23 18:29:22 2015-01-23 18:22:56 1.05
9999 2 4.0 18.0 40.752113 -73.975189 1.0 5.0 0.3 0.5 1 ... 4.0 18.0 40.761505 -73.968452 0.00 0.00 6.80 2015-01-23 18:27:58 2015-01-23 18:22:57 0.75

10000 行 × 21 列

教程#

如果你想了解更多关于vaex的信息,请查看教程以了解可能的内容