要下载的数据集#

这里我们列出了一些可能对使用vaex探索感兴趣的数据集。

纽约出租车数据集#

非常著名的数据集，包含来自纽约标志性的黄色出租车公司的行程信息。原始数据由出租车与豪华轿车委员会（TLC）整理。

例如，参见Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance以获取一些想法。

也可以直接从S3流式传输数据。只有必要的数据会被流式传输，并且会在本地缓存：

import vaex
df = vaex.open('s3://vaex/taxi/nyc_taxi_2015_mini.hdf5?anon=true')

[ ]:

import vaex
import warnings; warnings.filterwarnings("ignore")

df = vaex.open('/data/yellow_taxi_2009_2015_f32.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

long_min = -74.05
long_max = -73.75
lat_min = 40.58
lat_max = 40.90

df.plot(df.pickup_longitude, df.pickup_latitude, f="log1p", limits=[[-74.05, -73.75], [40.58, 40.90]], show=True);

number of rows: 1,173,057,927
number of columns: 18

盖亚 - 欧洲航天局#

盖亚是一项雄心勃勃的任务，旨在绘制我们银河系的三维地图，在此过程中揭示银河系的组成、形成和演化。

详情请参见Gaia科学主页，您可能还想尝试Gaia档案库进行ADQL（类似SQL）查询。

[2]:

df = vaex.open('/data/gaia-dr2-sort-by-source_id.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.plot("ra", "dec", f="log", limits=[[360, 0], [-90, 90]], show=True);

number of rows: 1,692,919,135
number of columns: 94

美国航空公司数据集#

该数据集包含1988年至2018年间美国境内航班的信息。原始数据可以从美国交通部下载。

1988-2018年 - 1.8亿行 - 17GB

也可以从S3流式传输：

import vaex
df = vaex.open('s3://vaex/airline/us_airline_data_1988_2018.hdf5?anon=true')

[3]:

df = vaex.open('/data/airline/us_airline_data_1988_2018.hd5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.head(5)

number of rows: 183,821,926
number of columns: 29

[3]:

#	年份	月份	月中的第几天	星期几	UniqueCarrier	TailNum	航班号	出发地	目的地	计划起飞时间	实际起飞时间	起飞延误	滑行时间	滑入时间	计划到达时间	实际到达时间	到达延误	取消	CancellationCode	计划飞行时间	实际飞行时间	空中时间	距离	航空公司延误	天气延误	NAS延误	安全延误	晚点飞机延误
0	1988	1	8	5	PI	None	930	BGM	ITH	1525	1532	7	--	--	1545	1555	10	0	None	20	23	--	32	--	--	--	--	--
1	1988	1	9	6	PI	None	930	BGM	ITH	1525	1522	-3	--	--	1545	1535	-10	0	None	20	13	--	32	--	--	--	--	--
2	1988	1	10	7	PI	None	930	BGM	ITH	1525	1522	-3	--	--	1545	1534	-11	0	None	20	12	--	32	--	--	--	--	--
3	1988	1	11	1	PI	None	930	BGM	ITH	1525	--	--	--	--	1545	--	--	1	None	20	--	--	32	--	--	--	--	--
4	1988	1	12	2	PI	None	930	BGM	ITH	1525	1524	-1	--	--	1545	1540	-5	0	None	20	16	--	32	--	--	--	--	--

斯隆数字天空调查 (SDSS)#

数据是公开的，可以从SDSS档案中查询。在SDSS档案中的原始查询是（尽管被分成小部分）：

SELECT ra, dec, g, r from PhotoObjAll WHERE type = 6 and  clean = 1 and r>=10.0 and r<23.5;

[4]:

df = vaex.open('/data/sdss/sdss-clean-stars-dered.hdf5')

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.healpix_plot(df.healpix9, show=True, f="log1p", healpix_max_level=9, healpix_level=9,
                healpix_input='galactic', healpix_output='galactic', rotation=(0,45)
               )

number of rows: 132,447,497
number of columns: 21

Helmi & de Zeeuw 2000#

33个卫星星系在银河系暗物质晕中吸积的N体模拟结果。* 300万行 - 252MB

[5]:

df = vaex.datasets.helmi_de_zeeuw.fetch() # this will download it on the fly

print(f'number of rows: {df.shape[0]:,}')
print(f'number of columns: {df.shape[1]}')

df.plot([["x", "y"], ["Lz", "E"]], f="log", figsize=(12,5), show=True, limits='99.99%');

number of rows: 3,300,000
number of columns: 11

要下载的数据集

目录