Author Topic: no output from sort-bed for large files  (Read 4090 times)

pachkov

  • Newbie
  • *
  • Posts: 14
no output from sort-bed for large files
« on: January 10, 2014, 07:58:06 AM »
Dear All,

I was trying to use sort-bed fro sorting our bed files and encountered strange problem. It works fine for files containing up to 50,000,000 lines but for 60,000,000 return no output. No error messages or other hints, just finished with no output.

This how I run it:

zcat test.bed.gz | sort-bed --max-mem 1G - &> out.bed

Are there limitations for sort-bed on file size? How can I find out what is wrong?

Best,

Mikhail

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #1 on: January 10, 2014, 02:10:51 PM »
Hi Mikhail, thanks for the post.
No, there shouldn't be a problem with this.  What version of BEDOPS are you using?  Can you post the results of sort-bed --version.  We actually released BEDOPS v2.4 yesterday to the community, and one thing we've improved was sort-bed.  If you're using v2.3 or below, it's possible that you are running into a maximum number of open file descriptors problem with --max-mem 1G.  v2.4 resolves these rare issues.  Can you try to download and use the latest software and report back your findings?

Thanks,
Shane

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #2 on: January 10, 2014, 03:23:17 PM »
Thanks for prompt reply, Shane!

Version is 2.4.0. I have downloaded 64bit binaries yesterday.

I must add that I have tried to run it with max mem up to 8Gb and still the same thing - empty output.

Best,

Mikhail

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #3 on: January 10, 2014, 04:00:01 PM »
OK, well, perhaps there is a new bug.  Is there anywhere you could place this file so that we could download it and investigate?  It's huge, but that would certainly help us figure out the problem.  Is this possible?

Also, if you do not redirect stderr with stdout, I wonder if it will print something to your terminal?  It's doubtful, but might be worth trying since this seems to be an easy problem to repeat.

Are you running the linux binaries or the mac binaries?  I've been able to sort hundreds of millions of rows with the --max-mem flag without issue.  Definitely would like to see your .gz test file.
« Last Edit: January 10, 2014, 06:17:36 PM by sjn »

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #4 on: January 12, 2014, 05:40:58 AM »
Here is the link:

http://ismara.unibas.ch/ISMARA/sample_data/tmp.bed.gz

I was redirecting stderr and stdout to separate files in order to catch something, but there is nothing. Not sure if it matters but I run binaries (64bit) on quite outdated Linux version (CentOS 5, I believe).

Thanks for help!

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #5 on: January 12, 2014, 10:58:17 AM »
OK, downloading now.  I did sort a 400 million line file yesterday with --max-mem 2G set.

It might be a good idea for you to build BEDOPS from source.  Just do a g++ --version and make sure you have version 4.7 or later.  And, you'll need git.

git clone https://github.com/bedops/bedops.git
cd bedops
make && make install

bin/sort-bed --help

It takes a few minutes to build everything, but it should go smoothly.
« Last Edit: January 12, 2014, 11:18:28 AM by sjn »

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #6 on: January 12, 2014, 12:11:26 PM »
I would try to build it but as I said our system is quite outdated and I am not sure our sysadmins will manage to install new gcc on it.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #7 on: January 12, 2014, 12:20:18 PM »
Can you give the results of:
uname -a

perhaps we can install this through a VM and create a build for you if needed.

At 30 kB/s, it's going to take a long time to download your file, so I'll report any problems I see on this probably tomorrow.

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #8 on: January 12, 2014, 12:33:14 PM »
Linux  2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #9 on: January 12, 2014, 12:44:25 PM »
Well, I just downloaded the pre-built binaries, and 'file bin/sort-bed' shows it was built using linux kernel 2.6.18 so it might not be helpful to go down that path afterall.  After I get your file, I'll play around and report back.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #10 on: January 13, 2014, 01:56:25 AM »
OK, I finally downloaded that file.  I sorted it with no issues (--max-mem 2G).  If you're interested in the result, you can download from here:
http://www.uwencode.org/proj/sneph/forum/download.html

It's a highly-compressed 'starch' file, which you can use directly with bedops/bedmap/closest-features (you probably don't need to expand it).  Or, you can expand the file through:
unstarch pachkov.sort.starch > pachkov.bed

Can you post how much system memory (ram) and disk space you have?  For the latter, you can use: 'df -h'

I just want to be sure that you're not running out of disk space or something like that.  With sort-bed's --max-mem option, data are written to temporary files as needed.

« Last Edit: January 13, 2014, 02:11:33 AM by sjn »

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #11 on: January 13, 2014, 02:59:48 AM »
Thank you! So it is something about our system.

On  /tmp we have only 3.7G free but I was running sort-bed with $TMPDIR=~/bedops_tmp, for our home we have 3.3T free. BTW I was checking $TMPDIR and /tmp but have not seen any temporary files. Are they very short living?
We have 12G of RAM, 7G of which is almost aways free. I was also monitoring RAM during runs, sort-bed was using not more than given by max-mem and it was a lot of RAM still free.

Best,

Mikhail

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: no output from sort-bed for large files
« Reply #12 on: January 13, 2014, 07:27:45 AM »
We're using C/C++ tmpfile() to create the temporary file.  This function has some drawbacks/concerns, but doesn't seem problematic for our needs.  The way it works in linux makes it tough to track things the way you tried to.

'On some implementations (e.g. Linux), this function actually creates, opens, and immediately deletes the file from the file system: as long as an open file descriptor to a deleted file is held by a program, the file exists, but since it was deleted, its name does not appear in any directory, so that no other process can open it. Once the file descriptor is closed, the space occupied by the file is reclaimed by the filesystem.'

from http://en.cppreference.com/w/cpp/io/c/tmpfile

while $TMPDIR works well for scripts, I'm unsure if it has the same effect for a non-script program like sort-bed (without explicitly looking for that environmental variable).  If you find out, let us know!  I did not envision the type of scenario you're in, where /tmp/ has little space but a network share does.  What you tried to do seems reasonable, and perhaps we can add the environmental variable check explicitly in sort-bed and document these types of situations.

Shane
« Last Edit: January 13, 2014, 07:40:36 AM by sjn »

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #13 on: January 13, 2014, 08:01:26 AM »
I see. From the quick glance in google tmpfile() should use $TMPDIR but I found it on some forums not in docs.
I will try to get it compiled locally and try again.

Anyway a feature for specifying tmpdir would be very useful!

Thank you for the help!

Best regards,

Mikhail

pachkov

  • Newbie
  • *
  • Posts: 14
Re: no output from sort-bed for large files
« Reply #14 on: January 13, 2014, 08:47:43 AM »
I managed to compile sort-bed on our machine and checked where files stored with strace. Indeed it is under /tmp whatever I set for $TMPDIR.