I’m still uncertain if only catching confusables in the output/normalized form is sufficient but it greatly simplifies the problem. All you need is a mapping from normalized characters to scripts (to determine if a label is single-script), additional disallowed single characters (which are just extra IDNA rules), and a small list of 2-4 length character sequences (~400) that are also disallowed anywhere in the output label.
Here is an ornate example I found where you have confusability on the input (that isn’t covered by UTS-39 because it’s an emoji+text situation) but clearly different on the output (using proposed normalization, without additional logic.)
Text-styled regional indicators potentially render as small-cap ASCII:
// default emoji-styled
1F1FA 1F1F8 = "🇺🇸" => "🇺🇸" // 2x regional (valid flag sequence)
1F1FA 1F1F9 = "🇺🇹" // 2x regional (invalid flag sequence)
// explicit emoji styling
1F1FA FE0F 1F1F8 FE0F = "🇺🇸" // same
// explicit text styling
1F1FA FE0E 1F1F8 FE0E = "🇺︎🇸︎" // small-caps ASCII (no flag)
// confusing example
62 1F1F4 FE0E 1F1F8 FE0E 73 = "b🇴︎🇸︎s" vs "boss" // before normalization
= "b🇴🇸s" // after normalization (FE0E is dropped)