Replies: 3 comments 10 replies
-
What do you think the performance hit for Python would be? |
Beta Was this translation helpful? Give feedback.
-
I think performance isn't the top priority, but rather flexibility as that is what bedtools lacks. This comes from the Rust and Python API for new scripts, and I would argue to use Python for filtering on the CLI, simply because it is intuitive for the target community. Another thing I have been thinking about is how stupidly complex the code in bedtools is for handling the numerous output structures depending on whether it is run without options (just print the intersecting sub-interval), using It seems that it might be easier if the output is always a the result of a left outer join (A ivls with no intersections receive NULL defaults for B ivls), and the options add relevant output columns to the output, depending on the options. An exception that ruins this concept is Here is an example
|
Beta Was this translation helpful? Give feedback.
-
@arq5x , I quite like your setup of the problem. Consider using string formatting (as in luau or python f-strings) to enable:
and for The final example from Aaron above, which was:
We can have:
We can have a default so that always "a.chrom\ta.start\ta.end" is already written, in which case, the and --format (equivalent to --add above):
Note that the user could also have a single --format with Then if we want to customize the first 3 columns:
the default which outputs chrom, start, end, could be modified with flags I will proceed with this. We will hit limitations or complications when we need to access, for example the INFO field of a VCF. It will need to look something like: def my_custom_function() {
columns = []
for iv in intersections.overlapping {
# now iv is an enum in rust
# we need the concrete type in python
if iv.type() == Type.VCF {
variant = iv.variant()
dp = variant.info.get("DP").integer() # again, we convert from enum to python
columns.append(str(dp))
} else if iv.type() == Type.BED {
# do something
} else {
# do something else
}
}
return "\t".join(columns)
} Presumably, that will be in a script
|
Beta Was this translation helpful? Give feedback.
-
With the current abstractions and API, it's trivial to get a command-line program going that can detect and handle any file type (BAM/CRAM/BED/VCF/etc) and perform intersections:
bedder-rs/src/main.rs
Lines 42 to 45 in fe37df3
Users of bedder-rs as a library can then quite simply handle the intersection structs:
bedder-rs/src/intersection.rs
Lines 38 to 51 in fe37df3
(where
Position
is an enum of BAM/CRAM/BED/etc).But how can we make this generally available from the command-line without implementing all of bedtools?
It might be possible to have a very flexible way to do this by allowing user expressions specified from the command-line. Due to the libraries and speeds available, I expect this will be in lua (see https://github.com/brentp/pbr for an example of allowing user expressions), but we can also make it in python with a performance hit.
What will this look like from the **command-line? Here are some on-the-fly musings that I'll revisit and perhaps completely change. Insight/ideas are welcome at any time.
Count intersections
Note that lua syntax for length of an array is actually
#intersections.overlapping
(!?) but we'll pretend for now it can use.len()
.Filter to intersecting
Add column to VCF
already, this gets quite complicated. But adding a field to a VCF should be one of the more onerous tasks.
With this, perhaps it's good to start by writing
union()
andcoverage()
methods in rust since these would be needed and will require more details.intersect and chop -a intervals to bases covered by -b
This is a common bedtools operation:
Now we already have:
Beta Was this translation helpful? Give feedback.
All reactions