Join us
Domains are designed to be readable and memorable, unlike IP addresses but that’s not all to them. A domain name consists of multiple parts:
Here, I’m going to implement a simple domain parser with Golang using the public suffix list.
The final Golang module is available on my Github.
Step #1: Download and parse the list
The public suffix list includes two main parts. ICANN and Private domains. The ICANN part starts with // ===BEGIN ICANN DOMAINS===
and ends with // ===END ICANN DOMAINS===
. The same rule applies to the private domains, they’re section starts with // ===BEGIN PRIVATE DOMAINS===
and ends with // ===END PRIVATE DOMAINS===
. We need to Consider this when we parse the list and create or tree of TLDs.
I’ve decided to add a mode
to the parsed file and cache it somewhere in the filesystem. mode=1
means the TLD belongs to the ICANN section and mode=2
means private domains. The final parsed file looks something like this
.
.
.
travelersinsurance,1
trust,1
trv,1
tube,1
tui,1
.
.
.
ui.nabu.casa,2
pony.club,2
of.fashion,2
.
.
.
These come in handy for creating the TLDs tree. I’ve added isPrivate
and isIcann
to each node.
for _, line := range lines {
line = strings.TrimSpace(line)
if line != "" && strings.HasPrefix(line, "// ===BEGIN ICANN DOMAINS===") {
mode = "1"
}
if line != "" && strings.HasPrefix(line, "// ===BEGIN PRIVATE DOMAINS===") {
mode = "2"
}
if line != "" && !strings.HasPrefix(line, "//") {
buffer.WriteString(line + "," + mode)
buffer.WriteString("\n")
}
}
Step #2: Use regex to extract different parts
We need to get rid of the schema part of the URL. The ^([[:lower:]\d\+\-\.]+:)?//
regex will do that for us.
After extracting the TLD we need to make sure that the root domain is in the valid format. ^[a-z0–9-\p{L}]{1,63}$
checks for the validity of the root part on the URL.
Extracting the subdomain part of the URL is easy. We just need to split the subdomain+root
part with a dot
separator.
func extractSubdomain(d string) (string, string) {
ps := strings.Split(d, ".")
l := len(ps)
if l == 1 {
return "", d
}
return strings.Join(ps[0:l-1], "."), ps[l-1]
}
If the extracted TLD is empty, we need to check if the URL is a valid IPv4/IPv6. Before using a regex
to match the IPv4 format we can use the built-in net.ParseIP(url)
and then only check for IPv4 (skip the regex for IPv6)
Step #3: TLD Trie
We use Trie to form TLDs. Take this part of the list for instance
// mz : http://www.uem.mz/
// Submitted by registry <antonio@uem.mz>
mz
ac.mz
adv.mz
co.mz
edu.mz
gov.mz
mil.mz
net.mz
org.mz
This will become a part of the Trei like this
Each node of the tree includes some information about the node like:
!
)Step #4: That’s it, no more steps
Install dex
using go modules.
go get github.com/mehrdadep/dex
And use it just like this
package main
import (
"fmt"
"github.com/mehrdadep/dex"
)
func main() {
cache := "/tmp/list.cache"
extract, _ := dex.New(cache)
result:=extract.Parse("https://mehrdadep.medium.com")
fmt.Printf("%+v\n",result)
}
This will outputs
&{IsIcann:true IsIpV4:false IsIpV6:false IsPrivate:false Subdomain:mehrdadep Root:medium Tld:com}
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.