The only restrictions with input to the
starch utility are that the file or input stream:
- must have at least three columns, which are tab-delimited and represent the chromosome label and start/stop positions
- must use unique identifiers for the first column (chromosome) that divide the input into subsets
- must use positive integers for the second and third columns (start and stop positions)
- must be sorted per bed-sort or bbms utilities (i.e., there is a lexicographical sort on the first column)
Some usage tips:
- up to 1 MB of data can be stored per row in any one or more additional columns (although adding more columnar data reduces the significant compression efficiency gains from the internal transformation of the BED coordinate data)
- there is no error checking on extra columns and you can put pretty much anything you want in here (except a newline character)
- the naming scheme for the chromosome labels can be anything, from chrN to fooBarBazN
- stripping non-essential columns from BED data (e.g., zeroed score or placeholder ID columns) is a useful way to gain additional storage efficiencies
- the unstarch utility can output the archive's metadata in tabular or JSON format, JSON being potentially useful for web service pipelines