I am generating random DNA sequences in R where each sequence is of a set length and contains a user-specified distribution of nucleotides.
What I want to be able to do is ensure certain runs of nucleotides are NOT generated in a given sequence. The runs that are disallowed are: "aga", "agg", "taa", "tag" and "tga".
Here is my code that simply generates sequences where the above runs MAY occur. I am unsure how best to modify the code to account for the "tabu" runs specified above.
library(ape)
length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies
# DNA alphabet
nucl <- as.DNAbin(c('a', 'c', 'g', 't')) # A, C, G, T
# Randomly sample nucleotides
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs)
I am thinking to simply list all the allowed runs which would be used in place of 'nucl' and specify 'size' = length.seqs / 3 within the sample() function, but this seems cumbersome, even with shortcuts like 'expand.grid()'.
Aucun commentaire:
Enregistrer un commentaire