python – 某个skiprows参数后的pandas内存错误教程
我有一个~1.81GB的CSV文件,有49行的行.它只有一列包含38个字符的字符串.
我正在使用Digital Ocean VPS(Ubuntu 12.04.4,Python 2.7,pandas 0.18.0,512MB RAM)上的read\_csv读取此文件.我一次读5000行.然而,它开始在skiprows = 2800000处引发错误.这是我在重新启动的计算机上测试的代码,刚刚启动的Python:
>>> pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in __init__
self.options, self.engine = self._clean_options(options, engine)
File "/home/ayhan/.conda/envs/swrm/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _clean_options
skiprows = set() if skiprows is None else set(skiprows)
MemoryError
如果我用skiprows = 1000000运行它可以正常工作.如果我尝试skiprows = 1500000它再次引发错误,这很奇怪,因为错误在达到2800000后开始.在此之前它经历了每5000个乘数没有问题.知道为什么会这样吗?
代码在我的个人计算机上正常工作:
df = pd.read_csv(filename, skiprows=2800000, nrows=5000, header=None)
df.memory_usage()
Out[25]:
Index 72
0 40000
dtype: int64
编辑:
原始循环如下:
current_chunk = 560
chnksize = 5000
for chunk in range(current_chunk, 1000):
df = pd.read_csv(filename, skiprows=chnksize*chunk, nrows=chnksize, header=None)
out = "chunk_" + format(chunk, "06d")
short_ids = df[0].str.slice(-11)
它从API查询short\_id并将结果附加到文件.但是我在顶部给出的代码片本身就会引发错误.
解决方法:
熊猫正在使用一种令人困惑的记忆密集型方法来实施掠夺.在pandas.io.parsers.TextFileReader._clean_options
:
if com.is_integer(skiprows):
skiprows = lrange(skiprows)
skiprows = set() if skiprows is None else set(skiprows)
lrange(n)确实列出(range(n)),所以这基本上是做skiprows = set(list(range(skiprows))).它正在构建两个巨大的列表和一组,每个包含280万个整数!我想他们从未期望人们试图跳过这么多行.
如果要以块的形式读取文件,则使用不同的跳过值重复调用read\_csv是一种低效的方法.您可以传递read\_csv一个chunksize选项,然后以块的形式迭代返回的TextFileReader:
In [138]: reader = pd.read_table('tmp.sv', sep='|', chunksize=4)
In [139]: reader
Out[139]: <pandas.io.parsers.TextFileReader at 0x121159a50>
In [140]: for chunk in reader:
.....: print(chunk)
.....:
Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
Unnamed: 0 0 1 2 3
0 4 -0.424972 0.567020 0.276232 -1.087401
1 5 -0.673690 0.113648 -1.478427 0.524988
2 6 0.404705 0.577046 -1.715002 -1.039268
3 7 -0.370647 -1.157892 -1.344312 0.844885
Unnamed: 0 0 1 2 3
0 8 1.075770 -0.10905 1.643563 -1.469388
1 9 0.357021 -0.67460 -1.776904 -0.968914
或传递iterator = True并使用get\_chunk获取指定大小的块:
In [141]: reader = pd.read_table('tmp.sv', sep='|', iterator=True)
In [142]: reader.get_chunk(5)
Out[142]:
Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
4 4 -0.424972 0.567020 0.276232 -1.087401