[Solved] what could be the best regex for domain validation? [closed]


Disclaimer: Note that the rules which define a “valid domain” constitute a moving target. The answer below deals only with the “old school” DNS rules (using exclusively ASCII characters) and does not attempt to deal with international domains (as laid out in RFC3490). Note also that there will soon be lots of new top level domains (TLD) popping up so the solution below will need to be updated on a regular basis (see: IANA.ORG for the current list of valid TLDs).

DNS Named Host Validation

According to the pertinent internet recommendations (RFC3986 section 2.2, which in turn refers to: RFC1034 section 3.5 and RFC1123 section 2.1), a subdomain (which is a part of a DNS domain host name), must meet several requirements:

Subdomain

  • Each subdomain part must have a length no greater than 63.
  • Each subdomain part must begin and end with an alpha-numeric (i.e. letters [A-Za-z] or digits [0-9]).
  • Each subdomain part may contain hyphens (dashes), but may not begin or end with a hyphen.

Here is an expression fragment for a subdomain part which meets these requirements:

(?:[A-Za-z0-9][A-Za-z0-9\-]{0,61}[A-Za-z0-9]|[A-Za-z0-9])

Note that this expression requires a group with two alternatives to handle the special case of a subdomain having only one character. Also, this expression fragment should not be used alone – it requires the incorporation of boundary conditions in a larger context, as demonstrated in the following expression for a DNS host name…

DNS host name

A named host, (not an IP address), must meet additional requirements:

  • The host name may consist of multiple subdomain parts, each separated by a single dot.
  • The length of the overall host name should not exceed 255 characters.
  • The top level domain, (the rightmost part of the DNS host name), must be one of the internationally recognized values. The list of valid top level domains is maintained by IANA.ORG. (See the bare-bones current list here: http://data.iana.org/TLD/tlds-alpha-by-domain.txt).

With this is mind, here a commented regex (in C# syntax), which will pseudo-validate a DNS host name: (Note that this incorporates a modified version of the above expression for a subdomain and adds comments to this as well).

if (Regex.IsMatch(text, @" # Rev:2013-03-26
    # Match DNS host domain having one or more subdomains.
    # Top level domain subset taken from IANA.ORG. See:
    # http://data.iana.org/TLD/tlds-alpha-by-domain.txt
    ^                  # Anchor to start of string.
    (?!.{256})         # Whole domain must be 255 or less.
    (?:                # Group for one or more sub-domains.
      [a-z0-9]         # Either subdomain length from 2-63.
      [a-z0-9-]{0,61}  # Middle part may have dashes.
      [a-z0-9]         # Starts and ends with alphanum.
      \.               # Dot separates subdomains.
    | [a-z0-9]         # or subdomain length == 1 char.
      \.               # Dot separates subdomains.
    )+                 # One or more sub-domains.
    (?:                # Top level domain alternatives.
      [a-z]{2}         # Either any 2 char country code,
    | AERO|ARPA|ASIA|BIZ|CAT|COM|COOP|EDU|  # or TLD 
      GOV|INFO|INT|JOBS|MIL|MOBI|MUSEUM|    # from list.
      NAME|NET|ORG|POST|PRO|TEL|TRAVEL|XXX  # IANA.ORG
    )                  # End group of TLD alternatives.
    $                  # Anchor to end of string.",
    RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace))
{
    // Valid named DNS host (domain).
} else {
    // NOT a valid named DNS host.
} 

Note that this expression is not perfect. It requires one or more subdomains, but technically, a host can consist of a TLD having no subdomain (but this is rare). It also does not explicitly spell out each two character country code TLD – it simply allows any two letters. It also does not list the various TLDs of the: XN--XXXXX variety. This solution also does not consider the not-yet-fully-implemented-and-universally-acceptable international domain names.

For more on validating other URI components, you may want to take a look at an article I wrote a while back: Regular Expression URI Validation. It provides code snippets in a variety of languages for all of the various URI components as defined by RFC3986.

Happy regexing!

0

solved what could be the best regex for domain validation? [closed]