Recent Posts

Pages: 1 ... 3 4 [5] 6 7 ... 10
41
closest-features / Re: BEDOPS Closest feature help
« Last post by AlexReynolds on November 18, 2015, 03:52:42 PM »
Please see my response here: https://www.biostars.org/p/166413/#166415

Basically, your input files need just a little cleaning before they can be used as BED files.
42
closest-features / BEDOPS Closest feature help
« Last post by shewalesj on November 18, 2015, 02:46:21 PM »
Hello Everyone,

i've got an output file from RnBeads, an R package, with differentially methylated regions. None of these regions actually come with gene IDs. I'm trying to use the BEDOPS closest feature to get the closest gene ID in comparison to my dataset excel file. I'm unable to do this successfully. Everytime i run the closest features command, the output file i get shows that none of my locations fell within the reference file? I'm assuming this since within the last column, every row ends with "|NA".

My question is, what am i doing wrong? Are my files in the wrong format? Did i download the correct file from ensembl? why are none of my regions of interest returning the closest region?


I've attached the following files.
1. tiling_fourm.xlsx <-- This excel file is what i'm trying to get the closest gene locations of.
2. tiling_forum.bed <-- Removed the first row (headings). I've taken the chromosome, start, and end column, saved it as a tab delimited txt file and changed the extension to .bed
3. hg19.bed <-- this is the reference file (analogous to the <inputfile> within the bedops closest features user guide page. This file i've created by going to https://genome.ucsc.edu/cgi-bin/hgTables and then selecting the following features: Clade: Human, genome: Human, assembly: feb2009grch37/hg19, group: mapping and sequencing, Track: UCSC Genes, table:known gene, region: genome, output format: select fields from primary and related tables. Then the fields selected were Chrom, chrinstart and chromend.
4. outputforum.txt <-- this is the result file i get after running the following cmd: losest-features --closest hg19.bed tiling_forum.bed > outputforum.txt .

As you can see the file thats produced contains the last column with values that end in "|NA" as in not mapped? I'm assuming?

If you anyone could help me out, it would be greatly appreciated.

thanks,
43
all / Re: finding overlapping regions of CHIP seq BEDs
« Last post by AlexReynolds on November 03, 2015, 01:11:39 PM »
You could perhaps take the multiset union of your ChIP-seq datasets:

$ bedops -u chip1.bed chip2.bed ... chipN.bed > chips.bed

Then use chips.bed as the map file in bedmap, applying the --echo-map-score operator to retrieve score (value) data from the unioned dataset:

$ bedmap --echo --echo-map-score --delim '\t' reference_regions.bed chips.bed > reference_regions_with_scores_from_overlapping_chip_peaks.bed

If this does what you want, then you can set up a pipeline to do both steps in one line of code:

$ bedops -u chip1.bed chip2.bed ... chipN.bed | bedmap --echo --echo-map-score --delim '\t' reference_regions.bed - > reference_regions_with_scores_from_overlapping_chip_peaks.bed

This would run faster, since you wouldn't create the intermediate file chips.bed.

I may not fully understand your question, but hopefully this suggests options. Please let us know if this does not help.
44
all / finding overlapping regions of CHIP seq BEDs
« Last post by cavoli on November 03, 2015, 12:48:39 PM »
Dear All hi,

I am having problems with finding the overlapping regions of chip seq data. with bedops I can find the overlapping regions of multiple chip-seq data. But I cannot implement the peak height of my data to bedops. My genomic locations are important as much as those values.

so my data is consisted of

chr start end peakname value.

Is there a way to implement that value ?

I thought that I can use bedmap for that purpose but I couldnt implement multiple files.

I would be more than glad if you could help me.

Thank your for your help.

Tunc.
45
all / Re: Issue with convert2bed
« Last post by AlexReynolds on October 16, 2015, 04:36:07 PM »
Your environment's PATH variable setting helps your shell find the applications you want to run.

So also make sure all your BEDOPS binaries are available from your environment PATH.

Usually, this means editing .bash_profile (or similar, depending on your shell) to include your current directory (which contains convert2bed and sort-bed etc.) in your PATH, or instead copying the BEDOPS binaries to /usr/local/bin or a similar folder that is already in your PATH.

If you're not sure what your PATH variable is set to, type echo $PATH and see what it says.
46
all / Re: Issue with convert2bed
« Last post by sjn on October 16, 2015, 06:26:24 AM »
Try:

convert2bed --input wig --output bed < GSM980987_WT_rep3_CG.wig > GSM980987_WT_rep3_CG.bed

There is a script that wraps convert2bed specifically for this use (and makes the syntax easier):

wig2bed < GSM980987_WT_rep3_CG.wig > GSM980987_WT_rep3_CG.bed

47
all / Issue with convert2bed
« Last post by catcha on October 15, 2015, 10:33:58 PM »
I have been trying to use convert2bed to convert wig files from ucsc to bed. However I keep getting the same message:

Code: [Select]
$ ./convert2bed --input=wig [--output=bed] < GSM980987_WT_rep3_CG.wig > GSM980987_WT_rep3_CG.bed
Error: Cannot find sort-bed binary required for sorting BED output
convert2bed
  version: 2.4.13
  author:  Alex Reynolds

I've looked for the sort-bed binary and it seems to be fine:
Code: [Select]
$ ./sort-bed
sort-bed
  citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
  version:  2.4.14
  authors:  Scott Kuehn

USAGE: sort-bed [--help] [--version] [--check-sort] [--max-mem <val>] [--tmpdir <path>] <file1.bed> <file2.bed> <...>
        Sort BED file(s).
        May use '-' to indicate stdin.
        Results are sent to stdout.

        <val> for --max-mem may be 8G, 8000M, or 8000000000 to specify 8 GB of memory.
        --tmpdir is useful only with --max-mem.

I'm not sure what the issue is.  Could it be my input wig file.  Its current format is:
Code: [Select]
$ head GSM980987_WT_rep3_CG.wig
track type=wiggle_0 name="WT rep3 mCG" description="WT replicate 3 mCG Stroud et al" visibility=full color=0,100,0 graphType=bar
variableStep chrom=chr1
109     1
110     -0.95
115     0.833333
116     -1
161     0.933333
162     -0.8
310     0.777778
311     -0.833333

Any help would be greatly appreciated.
48
sort-bed / Re: combine sort multiple compressed files?
« Last post by sjn on October 02, 2015, 04:51:23 PM »
The trade-off between speed and size is pretty straightforward within BEDOPS.  If query speed and/or merging file speed is the main goal, then using raw BED files is the best approach.  If space-saving is the main goal, then Starch is the best route.

BED:
When looking up some relatively small number of regions (say 30,000 or fewer), the fastest query speed by far happens with bedextract, and there is a way to use this tool even when there are fully-nested elements (see the note under "Retrieving elements which overlap target elements" at https://bedops.readthedocs.org/en/latest/content/reference/set-operations/bedextract.html#retrieving-elements-which-overlap-target-elements).
> bedextract a.bed b.bed

Just using bedops -e performs well too, and it's useful to use the --chrom flag when you can in order to speed up the retrieval of elements from a specific chromosome:
> bedops --chrom chr7 -e 10 a.bed b.bed

The bedops program (for the -e operation) actually merges elements in b.bed above.  You could use bedmap instead if this is an issue, though it will be a bit slower:
> bedmap --echo --chrom chr7 --bp-ovr 10 --skip-unmapped a.bed b.bed

(--faster makes bedmap much faster when it's applicable, but it is not applicable if you have fully-nested elements).

When unionizing bed files, it's best to sort things up front 1 time and then use bedops -u each time you get a new file(s) to merge.
> bedops -u current-repository.bed new-file1.sorted.bed new-file2.sorted.bed ... > new-repository.bed

Using sort-bed does work for unionizing information across files, but it will not scale as well as bedops -u

Starch:
The bedextract option is not applicable here.

The bedops and bedmap options still apply exactly as above, and you may continue to use the --chrom operator to help speed things up.  The files are compressed so there is a bit more work, and these options are necessarily slower here than when used with raw bed files.

For merging information, Alex already showed the main applications for starchcat.  You may also use bedops -u with Starch file inputs.  Using starchcat is much faster here if all inputs contain info from separate chromosomes as he described.  This is probably most applicable when you are gluing together results after operating on things per chromosome on a cluster or multiple processors.

I don't use SAMTools so I can't speak to relative speeds.  If speed is your main interest, then I recommend the bedextract trick for queries along with bedops -u when merging new+old files.  If space savings is the main goal, Starch is a no-brainer, and query speeds are still pretty decent with the --chrom option in bedops/bedmap.
49
sort-bed / Re: combine sort multiple compressed files?
« Last post by parlar on October 02, 2015, 12:05:26 AM »
Brilliant, thank you for your elaborate answer!

Just another question? Maybe I will not need to convert from starch -> bgzip+tabix index?
Since I will have nested features, I will need to use bedops --Element-of for extraction. Do you know how fast this is for (compressed) bed files with millions of rows as compared to using tabix chr:xxxx-yyyy ? If the speed is comparable, there is really no need to convert to bgzip/tabix-index. I could not find any benchmarks on this.
50
sort-bed / Re: combine sort multiple compressed files?
« Last post by AlexReynolds on October 01, 2015, 11:00:26 AM »
If you use the Starch compression format (http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starch.html) then you can use starchcat to merge Starch files. I explain below why that might be useful for you.

First, for an overview about starchcat, here's a link to documentation for starchcat: http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starchcat.html

Starchcat operates most efficiently when records are merged from starch files for separate chromosomes.

As an example, let's say we start with 10 starch files, each file containing records for chromosomes 1 through 10, exclusively. In other words, each starch file only contains records for that chromosome. Starch file 1A has records only for chr1, file 2A for chr2, file 3A for chr3, and so on. Starch and unstarch compress and extract BED data:

$ starch 1A.bed > 1A.starch
$ unstarch 1A.starch > 1A.bed

You want to merge these ten files starch to one starch file, to make a final product. You can use starchcat to do this.

$ starchcat 1A.starch 2A.starch ... 10A.starch > 1A_through_10A.starch

Because each file contains records specific to a chromosome, starchcat knows that it can merge the compressed data as-is, without any costly extraction and re-compression. It just copies over the compressed chromosomes to the output.

That's the simple case. Let's say you do some sequencing work and only need to update data in chromosome 4. You have a starch file called 4B.starch. So you use starchcat again.

This time, this operation will copy compressed data as-is -- just the raw bytes -- from unchanged chromosomes 1-3 and 5-10. However, on chromosome 4, starch files 4A and 4B are extracted, a merge sort is performed on BED elements from 4A and 4B, and the merged data are written to a newly-compressed chromosome 4. The only extra work done is on chromosome 4.

By segregating work by chromosome, and using starchcat to only recompressed updated records from an updated chromosome, you can make and manage an updated final product with much less work (time) than recompressing the entire BED file.

Also, if you have a computational cluster (like a Grid Engine or AWS instances, etc.) you can easily parallelize the work by segregating work by chromosome.

You might farm out 10 compression tasks to 10 computational nodes, each node working on compressing records for one chromosome. When all 10 tasks are complete, you trigger a final collation task with starchcat to merge the 10 starch files into one final product. If you have disk space, you could trigger another task after that, to make a separate tabix file for convenience.

As a format, Starch is also generally more efficient than gzip or bzip2. It preprocesses BED data to reduce redundancy, before a second compression step. Given 28 GB of raw bedGraph, for example, bzip2 reduces this to about 24% of the original file size, while Starch gets to about 5% of the original file size.

As a second alternative approach to sort-bed, you could also use "bedops -u" or "bedops --everything" on as many BED files as you specify, which I believe internally does a merge sort on all the specified files:

$ bedops -u A.bed B.bed C.bed ... N.bed > A_through_N.bed

I think this will run a little slower than sort-bed, when sort-bed operates within system memory. Sort-bed has a mode where, if you specify a memory allocation, it will use a merge sort on temporary files, which will run slower but allow sorting of BED files larger than system memory:

$ sort-bed --max-mem 2G A.bed B.bed C.bed ... N.bed > A_through_N.bed

So, to recap, here are some approaches you could explore:

- starchcat on multiple Starch archives, no parallelization (copy of unchanged chromosomes, merge sort on changed chromosomes)
- starchcat on multiple Starch archives, with parallelization
- bedops -u/--everything on multiple BED files (merge sort)
- sort-bed on multiple BED files (quicksort, within system memory)
- sort-bed --max-mem on multiple BED files (merge sort)

Hopefully this helps. If you try out Starch and like it, let us know!
Pages: 1 ... 3 4 [5] 6 7 ... 10