jeudi 23 avril 2020

Generating DNA codon combinations in R

I am generating random DNA sequences in R where each sequence is of a set length and contains a user-specified distribution of nucleotides.

What I want to be able to do is ensure certain runs of nucleotides are NOT generated in a given sequence. The runs that are disallowed are: "aga", "agg", "taa", "tag" and "tga".

Here is my code that simply generates sequences where the above runs MAY occur. I am unsure how best to modify the code to account for the "tabu" runs specified above.

library(ape)

length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies

# DNA alphabet
nucl <- as.DNAbin(c('a', 'c', 'g', 't')) # A, C, G, T

# Randomly sample nucleotides
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs) 

I am thinking to simply list all the allowed runs which would be used in place of 'nucl' and specify 'size' = length.seqs / 3 within the sample() function, but this seems cumbersome, even with shortcuts like 'expand.grid()'.




Aucun commentaire:

Enregistrer un commentaire