Counting instances of genotype strings where order is irrelevant within locus

Question

I'm looking at genetics punnet squares for two homologous chromosome pairs, which I've modeled in R like this:

# Punnet square for two homologous chromosome pairs
col_names<-c("AB", "Ab", "aB", "ab")
row_names<-c("AB", "Ab", "aB", "ab")
offspring<-as.data.frame(matrix(ncol=length(col_names), nrow=length(row_names),
                                dimnames=list(row_names, col_names)))

for(i in 1:length(row_names)){
  for(j in 1: length(col_names)){
    offspring[i,j]<-paste0(substr(rownames(offspring)[i],1,1), substr(colnames(offspring)[j],1,1), paste0(substr(rownames(offspring)[i],2,2), substr(colnames(offspring)[j],2,2)))
  }
}

> print(offspring)
     AB   Ab   aB   ab
AB AABB AABb AaBB AaBb
Ab AAbB AAbb AabB Aabb
aB aABB aABb aaBB aaBb
ab aAbB aAbb aabB aabb

I now want a frequency count for unique values within this table, however unique in this instances doesn't matter what order the characters in the string are in (Aa is equivalent to aA, for example, but AA is different to aa). Is there a simple way to do this? The result should be like this:

AABB: 1 AABb: 2 AAbb: 1

AaBB: 2 AaBb: 4 Aabb: 2

aaBB: 1 aaBb: 2 aabb: 1

Rui Barradas · Accepted Answer · 2026-03-04 10:33:14Z

6

For each string, sort its characters, put them back together and count the frequencies with table.

fun <- function(x) {
  lapply(x, strsplit, "") |> 
    unlist(recursive = FALSE, use.names = FALSE) |>
    lapply(sort) |>
    sapply(paste, collapse = "") |>
    table(useNA = "ifany")
}

fun(offspring)
#> 
#> aabb aabB aaBB aAbb aAbB aABB AAbb AAbB AABB 
#>    1    2    1    2    4    2    1    2    1

^{Created on 2026-03-03 with reprex v2.1.1}

You can output as a data.frame by piping the above to as.data.frame.

fun(offspring) |> as.data.frame()
#>   Var1 Freq
#> 1 aabb    1
#> 2 aabB    2
#> 3 aaBB    1
#> 4 aAbb    2
#> 5 aAbB    4
#> 6 aABB    2
#> 7 AAbb    1
#> 8 AAbB    2
#> 9 AABB    1

^{Created on 2026-03-03 with reprex v2.1.1}

edited 18 hours ago

answered yesterday

Rui Barradas

78.3k8 gold badges41 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Phenomniverse yesterday

I've marked this as the correct answer because a) it works, and b) it is the easiest for me to understand. Probably I would have worked out a solution something like this myself eventually. But the other answers are also good. @jay.sf's generalizable solution for multiple loci is potentially very useful, but would need to confirm using all 64 allele combinations of a trihybrid cross rather than the 16 randomly generated ones demonstrated.

Rui Barradas yesterday

Thanks! If you want to generalize the problem the recommended way is to ask another question with a link to this one.

jay.sf 23 hours ago

@Phenomniverse Please see the edit to my answer. I added a mk_punnet() example generating the full 8×8 trihybrid Punnett grid (64 offspring combinations), which confirms the function returns the expected collapsed counts.

Friede 19 hours ago

At least you can collapse sort and paste into a single loop to reduce redundancy in run time.

Rui Barradas 18 hours ago

Thanks, I have just tested it and the original is, surprisingly, faster.

jblood94 · Accepted Answer · 2026-03-03 15:26:14Z

A quick one-liner using Tmisc::strSort (same basic approach of sorting the strings before using table):

library(Tmisc)
table(strSort(unlist(offspring, 0, 0)))
#> 
#> aabb aabB aaBB aAbb aAbB aABB AAbb AAbB AABB 
#>    1    2    1    2    4    2    1    2    1

A faster option if performance is a concern. This uses the non-sorting unordered match approach found in this answer.

strtable <- function(x) {
  x <- unlist(x, 0, 0)
  y <- mapply(utf8ToInt, x, USE.NAMES = FALSE)
  u <- unique(`dim<-`(y, NULL))
  y[] <- match(y, u)
  y <- colSums(y*sin(y*pi/(length(u) + 1)))
  i <- which(!duplicated(y))
  `names<-`(tabulate(match(y, y[i])), x[i])
}

strtable(offspring)
#> AABB AAbB aABB aAbB AAbb aAbb aaBB aabB aabb 
#>    1    2    2    4    1    2    1    2    1

Benchmarking against the functions from @RuiBarradas and @ThomasIsCoding (the function provided by @jay.sf does not return the correct counts for the general problem of counting occurrences of unordered strings--it assumes the OP's specific problem of "diploid, biallelic loci" as noted in jay.sf's answer).

microbenchmark::microbenchmark(
  strtable(str),
  table(strSort(unlist(str, 0, 0))),
  Rui_Barradas(str),
  tic(str)
)
#> Unit: milliseconds
#>                               expr      min        lq      mean   median       uq      max neval
#>                      strtable(str)   8.3294  10.06245  12.71034  12.6583  14.5268  22.7197   100
#>  table(strSort(unlist(str, 0, 0))) 169.1872 215.74645 236.81358 242.0942 260.7001 325.7841   100
#>                  Rui_Barradas(str) 172.1685 215.09295 252.07995 252.9611 283.8306 333.5353   100
#>                           tic(str) 149.1786 189.83450 210.64926 213.6807 228.9248 318.0333   100

Benchmark data:

str <- stringi::stri_rand_strings(1e4, 4, "[A-Ea-e]")

# partial check for correctness
res <- vector("list", 4)
res[[1]] <- sort(as.integer(table(strSort(unlist(str, 0, 0)))))
res[[2]] <- sort(as.integer(strtable(str)))
res[[3]] <- sort(as.integer(Rui_Barradas(str)))
res[[4]] <- sort(as.integer(tic(str)))
identical(res[-1], res[-4])
#> [1] TRUE

Functions:

Rui_Barradas <- function(x) {
  lapply(x, strsplit, "") |> 
    unlist(recursive = FALSE) |>
    lapply(sort) |> 
    sapply(paste, collapse = "") |>
    table(useNA = "ifany")
}

tic <- function(x) { # ThomisIsCoding
  table(sapply(unlist(x), \(x) intToUtf8(utf8ToInt(x)[order(utf8ToInt(x))])))
}

Could you please specify where my function is not returning the correct counts?
Sure. See my benchmark dataset (str <- stringi::stri_rand_strings(1e4, 4, "[A-Ea-e]")). It gives a much higher count of unique unordered strings than the other answers. If I use your mk_mat function to produce the test data, it matches. It may be that your function is assuming something about the structure of the strings that matches the OP's example data. Your function is the second fastest if I use str <- as.data.frame(mk_mat(200, 6)) as the benchmark dataset (just over twice the speed of strtable).
My function is correct for the problem as posed. The OP is working with structured genotype strings (e.g., XxYYzz), not arbitrary random strings. If you benchmark on unrestricted strings (e.g., aHTfsD), you’re solving a different problem, so differing counts reflect a different interpretation—not an error in my function.
Agreed. The other answers and mine match the title of the question, which is more general than the specific question posed by the OP. Your answer is good, and I see that you added a note clarifying that point.
I’ve updated the title to reflect the genotype structure implied by the example and expected output.
I don't think that is a good idea. I strongly suggest reverting the change. Most of the answers here are for the more general question posed by the original title, and the title change may throw off searches.

jay.sf · Accepted Answer · 2026-03-04 05:31:02Z

To treat Xx and xX as the same we might use a strsplit/sort approach, but on factor levels for sake of efficiency. This assumes diploid, biallelic loci.

canonicalize_genotype <- \(x, decreasing.=TRUE) {
  base_levels <- \(A) {
    strsplit(A, '') |> 
      lapply(sort.int, decreasing=decreasing.) |> 
      sapply(paste, collapse='')
  } 
  x <- unlist(x)
  len <- nchar(x)
  if (var(len) != 0) {
    stop('lengths ambiguous.')
  } else {
    len <- el(len)
  }
  a <- vapply(seq_len(len/2) - 1, \(i) {
    substr(x, 1 + 2*i, 2 + 2*i)
  }, FUN.VALUE=character(length(x))) |> 
    as.factor()
  starts <- seq(1L, len, by=2L)
  ends <- starts+1L
  substring(x, starts, ends)
  levels(a) <- base_levels(levels(a))
  matrix(a, ncol=len/2) |>
    as.data.frame() |>
    Reduce(f=paste0)
}

Gives:

> canonicalize_genotype(offspring) |> 
+   table()

aabb aaBb aaBB Aabb AaBb AaBB AAbb AABb AABB 
   1    2    1    2    4    2    1    2    1

Notice that the table displays Xx rather than xX, as the OP seems to prefer. To get xX, use canonicalize_genotype(., decreasing.=FALSE).

I wrote it so it generalizes to more than two loci [[aa, aA, Aa, AA], [bb, bB, Bb, BB], [cc, cC, Cc, CC], ...]:

> offspring3 |> canonicalize_genotype() |> table()

aabbcc aabbCc aabbCC aaBbcc aaBbCc aaBbCC aaBBcc aaBBCc aaBBCC Aabbcc AabbCc AabbCC AaBbcc AaBbCc AaBbCC 
     1      2      1      2      4      2      1      2      1      2      4      2      4      8      4 
AaBBcc AaBBCc AaBBCC AAbbcc AAbbCc AAbbCC AABbcc AABbCc AABbCC AABBcc AABBCc AABBCC 
     2      4      2      1      2      1      2      4      2      1      2      1

Data:

mk_mat <- \(m, n) {
  replicate(m*m, lapply(seq_len(n), \(i) {
    sample(c(letters[i], LETTERS[i]), replace=TRUE) |> 
      paste(collapse='')
  }), simplify=FALSE) |> 
    sapply(paste, collapse='') |> 
    matrix(m, m)
}

set.seed(42)
df <- mk_mat(4, 3) |> 
  as.data.frame()


mk_punnet <- \(nloci) {
  m <- matrix(as.logical(intToBits(0:(2^nloci - 1))), ncol=32, byrow=TRUE)
  loci <- t(replicate(2^nloci, LETTERS[seq_len(nloci)]))
  loci[m[, nloci:1]] <- tolower(loci[m[, nloci:1]])
  aloci <- asplit(loci, 1)
  outer(aloci, aloci, Vectorize(\(a, b) paste0(a, b, collapse=''))) |>
    `dimnames<-`(
      replicate(2, 
                apply(loci, 1, paste, collapse=''), 
                simplify=FALSE)
    ) |>
    as.data.frame()
}

offspring3 <- mk_punnet(nloci=3)

Giving:

> offspring3
       ABC    ABc    AbC    Abc    aBC    aBc    abC    abc
ABC AABBCC AABBCc AABbCC AABbCc AaBBCC AaBBCc AaBbCC AaBbCc
ABc AABBcC AABBcc AABbcC AABbcc AaBBcC AaBBcc AaBbcC AaBbcc
AbC AAbBCC AAbBCc AAbbCC AAbbCc AabBCC AabBCc AabbCC AabbCc
Abc AAbBcC AAbBcc AAbbcC AAbbcc AabBcC AabBcc AabbcC Aabbcc
aBC aABBCC aABBCc aABbCC aABbCc aaBBCC aaBBCc aaBbCC aaBbCc
aBc aABBcC aABBcc aABbcC aABbcc aaBBcC aaBBcc aaBbcC aaBbcc
abC aAbBCC aAbBCc aAbbCC aAbbCc aabBCC aabBCc aabbCC aabbCc
abc aAbBcC aAbBcc aAbbcC aAbbcc aabBcC aabBcc aabbcC aabbcc

ThomasIsCoding · Accepted Answer · 2026-03-04 13:56:52Z

4

As @Rui Barradas mentioned, you could first make the sequence sorted and then apply table.

My solution follows the same idea but use order + intToUtf8 + utf8ToInt to finish the sorting processes. For example, given input data like

x <-c("AB", "Ab", "aB", "ab")
y <-c("AB", "Ab", "aB", "ab")

then you can run

table(outer(x,y,Vectorize(\(p,q) {u <- paste0(p,q); intToUtf8(utf8ToInt(u)[order(c(1,2,1,2),utf8ToInt(u))])})))

or

table(sapply(strsplit(outer(x,y,paste0),""), \(p) paste0(p[order(p,c(1,2,1,2))], collapse = "")))

and you will obtain

aabb aaBb aaBB Aabb AaBb AaBB AAbb AABb AABB 
   1    2    1    2    4    2    1    2    1

edited 15 hours ago

answered yesterday

ThomasIsCoding

106k9 gold badges38 silver badges110 bronze badges

2 Comments

user2554330 yesterday

It would be simpler to use sort(utf8ToInt(u)) in place of utf8ToInt(u)[order(utf8ToInt(u))]. Another possible function to consider is strsplit(); using it could avoid the utf8ToInt() calls, but I don't think it ends up simpler.

ThomasIsCoding yesterday

thank you for your comment! yes, sort should be much simpler, but I guess it cannot give the correct output. I also realized that I made a mistake in my previous solution, so I updated it.

Collectives™ on Stack Overflow

Counting instances of genotype strings where order is irrelevant within locus

4 Answers 4

5 Comments

7 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

7 Comments

Comments

2 Comments

Linked

Related