Skip to content

URL Parsing

This guide covers how pywhatwgurl parses URLs according to the WHATWG URL Standard.

The URL Constructor

The URL class constructor accepts one or two arguments:

URL(url: str, base: str | URL | None = None)
  • url: The URL string to parse
  • base: Optional base URL for resolving relative URLs

Spec Reference

See URL class constructor in the WHATWG URL Standard.

Absolute URLs

An absolute URL contains all the information needed to locate a resource:

from pywhatwgurl import URL

# Complete absolute URL
url = URL("https://example.com:8080/path?query#hash")

# Minimal absolute URL (just scheme and host)
url = URL("https://example.com")

Relative URLs

Relative URLs are resolved against a base URL using the URL parsing algorithm:

from pywhatwgurl import URL

base = URL("https://example.com/docs/guide/intro.html")

# Path-relative
URL("./images/logo.png", base)  # https://example.com/docs/guide/images/logo.png

# Parent directory
URL("../api/", base)  # https://example.com/docs/api/

# Root-relative
URL("/absolute/path", base)  # https://example.com/absolute/path

# Protocol-relative
URL("//other.com/path", base)  # https://other.com/path

# Query-only
URL("?newquery", base)  # https://example.com/docs/guide/intro.html?newquery

# Fragment-only  
URL("#section", base)  # https://example.com/docs/guide/intro.html#section

URL Normalization

pywhatwgurl normalizes URLs during parsing according to the URL serialization rules:

Scheme Normalization

Schemes are lowercased per the scheme state:

URL("HTTPS://Example.Com").href  # "https://example.com/"

Host Normalization

Hostnames are lowercased and IDNA-encoded per host parsing:

URL("https://EXAMPLE.COM").hostname        # "example.com"
URL("https://例え.jp").hostname             # "xn--r8jz45g.jp"

IDNA Processing

Domain names are processed using IDNA 2008 with CheckHyphens=false and VerifyDnsLength=false as specified in the domain to ASCII algorithm.

Path Normalization

Paths are percent-encoded and dot segments resolved per path state:

URL("https://example.com/a/../b/./c").pathname  # "/b/c"
URL("https://example.com/hello world").pathname  # "/hello%20world"

Default Ports

Default ports for known schemes are omitted per URL serialization:

URL("https://example.com:443").port  # "" (empty, default port)
URL("https://example.com:8080").port  # "8080"

Special Schemes

The WHATWG URL Standard defines special schemes with specific parsing rules:

Scheme Default Port Notes
http 80 Standard HTTP
https 443 Secure HTTP
ftp 21 File Transfer Protocol
ws 80 WebSocket
wss 443 Secure WebSocket
file Local files

Special schemes have additional behaviors:

from pywhatwgurl import URL

# Special schemes get a default "/" path
URL("https://example.com").pathname  # "/"

# Non-special schemes preserve empty path
URL("custom://example.com").pathname  # ""

# Special schemes allow backslash as path separator
URL("https://example.com\\path").pathname  # "/path"

Parsing Errors

When a URL is invalid, a ValueError is raised. The spec defines validation errors for various failure conditions:

from pywhatwgurl import URL

# Invalid scheme
try:
    URL("://no-scheme")
except ValueError:
    print("Invalid URL")

# Use URL.parse() for graceful error handling
url = URL.parse("invalid url")  # Returns None instead of raising

Error Messages

Error messages describe why parsing failed, helping debug malformed URLs.

Comparison with urllib.parse

pywhatwgurl follows the WHATWG standard while urllib.parse follows RFC 3986. Key differences:

Scenario pywhatwgurl urllib.parse
http://example.com\\path /path \\path
http://example.com:80/path Port omitted Port preserved
Hostname case Always lowercase Preserved
Unicode hosts IDNA encoded May raise error

Why WHATWG?

The WHATWG URL Standard defines precise parsing rules for web browsers, ensuring consistent behavior across the web. RFC 3986 is a generic URI standard used in broader contexts beyond web browsing.

Further Reading

Next Steps