Implementing a domain parser using Golang

1_kD7yQO97DnoMqQHFPt7DSw.png

Domains are designed to be readable and memorable, unlike IP addresses but that’s not all to them. A domain name consists of multiple parts:

Here, I’m going to implement a simple domain parser with Golang using the public suffix list.

The final Golang module is available on my Github.

Step #1: Download and parse the list

The public suffix list includes two main parts. ICANN and Private domains. The ICANN part starts with // ===BEGIN ICANN DOMAINS=== and ends with // ===END ICANN DOMAINS===. The same rule applies to the private domains, they’re section starts with // ===BEGIN PRIVATE DOMAINS=== and ends with // ===END PRIVATE DOMAINS===. We need to Consider this when we parse the list and create or tree of TLDs.

I’ve decided to add a mode to the parsed file and cache it somewhere in the filesystem. mode=1 means the TLD belongs to the ICANN section and mode=2 means private domains. The final parsed file looks something like this

                .
.
.
travelersinsurance,1
trust,1
trv,1
tube,1
tui,1
.
.
.
ui.nabu.casa,2
pony.club,2
of.fashion,2
.
.
.
            

These come in handy for creating the TLDs tree. I’ve added isPrivate and isIcann to each node.

                for _, line := range lines {
   line = strings.TrimSpace(line)
   if line != "" && strings.HasPrefix(line, "// ===BEGIN ICANN DOMAINS===") {
      mode = "1"
   }
   if line != "" && strings.HasPrefix(line, "// ===BEGIN PRIVATE DOMAINS===") {
      mode = "2"
   }
   if line != "" && !strings.HasPrefix(line, "//") {
      buffer.WriteString(line + "," + mode)
      buffer.WriteString("\n")
   }
}
            

Step #2: Use regex to extract different parts

We need to get rid of the schema part of the URL. The ^([[:lower:]\d\+\-\.]+:)?// regex will do that for us.

After extracting the TLD we need to make sure that the root domain is in the valid format. ^[a-z0–9-\p{L}]{1,63}$ checks for the validity of the root part on the URL.

Extracting the subdomain part of the URL is easy. We just need to split the subdomain+root part with a dot separator.

                func extractSubdomain(d string) (string, string) {
   ps := strings.Split(d, ".")
   l := len(ps)
   if l == 1 {
      return "", d
   }
   return strings.Join(ps[0:l-1], "."), ps[l-1]
}
            

If the extracted TLD is empty, we need to check if the URL is a valid IPv4/IPv6. Before using a regex to match the IPv4 format we can use the built-in net.ParseIP(url) and then only check for IPv4 (skip the regex for IPv6)

Step #3: TLD Trie

We use Trie to form TLDs. Take this part of the list for instance

                // mz : http://www.uem.mz/
// Submitted by registry <antonio@uem.mz>
mz
ac.mz
adv.mz
co.mz
edu.mz
gov.mz
mil.mz
net.mz
org.mz
            

This will become a part of the Trei like this

Each node of the tree includes some information about the node like:

  • Whether is it a private domain or ICANN
  • Is it a valid TLD
  • Is it an exception rule (Rules starting with ! )

Step #4: That’s it, no more steps

Install dex using go modules.

                go get github.com/mehrdadep/dex
            

And use it just like this

                package main

import (
"fmt"
"github.com/mehrdadep/dex"
)


func main() {
 
    cache := "/tmp/list.cache"
    extract, _ := dex.New(cache)
    result:=extract.Parse("https://mehrdadep.medium.com")
    fmt.Printf("%+v\n",result)
}
            

This will outputs

                &{IsIcann:true IsIpV4:false IsIpV6:false IsPrivate:false Subdomain:mehrdadep Root:medium Tld:com}
            

Only registered users can post comments. Please, login or signup.

Start blogging about your favorite technologies and get more readers

Join other developers and claim your FAUN account now!

Avatar

Mehrdad Esmaeilpour

@mehrdadep
Software engineer, writer and something of a dreamer.
Stats
16

Influence

283

Total Hits

1

Posts