
Ĭan someone suggest, How I can get rid of the problem? I have shared that file via drop box if it helps in seeing the real problem.

I am not so clear why so much memory space is taken by 250 mb file plus some outputs. When I check the system monitor I can see that memory consumption increased by about 6 GB. Test example: I created a test file ("genome_matrix_final-chr1234-1mb.txt") of upto 250 mb and ran the program. Result = p.map(matrix_to_vcf, list(gen_matrix_df_list.values())) P = Pool(3) # number of pool to run at once default at 1 I think improvements to this code is/are required at this code position, when I start multiprocessing. I also tried to tweak using chunkSize and maxTaskPerChild, etc but I am not seeing any difference in optimization in both test vs. Still with 14 GB input I was only expecting 2*14 GB memory burden, but it seems like lot is going on. I am also closing the pool as soon as it is done.

I have added codes to clear the memory as soon as the data/variable isn't useful. But, when I pipe in my large data (about 14 GB), the memory consumption exponentially increases and then freezes the computer or gets killed (in HPC cluster). then pipe the data to multiprocess Pool.map() to process each dataframe in parallel.Įverything is fine, the program works well on my small test dataset.then groupby using a specific column value to split the data and store as list of dataframes.read a huge text file as pandas dataframe.
