URL Parsing¶
This guide covers how pywhatwgurl parses URLs according to the WHATWG URL Standard.
The URL Constructor¶
The URL class constructor accepts one or two arguments:
- url: The URL string to parse
- base: Optional base URL for resolving relative URLs
Spec Reference
See URL class constructor in the WHATWG URL Standard.
Absolute URLs¶
An absolute URL contains all the information needed to locate a resource:
from pywhatwgurl import URL
# Complete absolute URL
url = URL("https://example.com:8080/path?query#hash")
# Minimal absolute URL (just scheme and host)
url = URL("https://example.com")
Relative URLs¶
Relative URLs are resolved against a base URL using the URL parsing algorithm:
from pywhatwgurl import URL
base = URL("https://example.com/docs/guide/intro.html")
# Path-relative
URL("./images/logo.png", base) # https://example.com/docs/guide/images/logo.png
# Parent directory
URL("../api/", base) # https://example.com/docs/api/
# Root-relative
URL("/absolute/path", base) # https://example.com/absolute/path
# Protocol-relative
URL("//other.com/path", base) # https://other.com/path
# Query-only
URL("?newquery", base) # https://example.com/docs/guide/intro.html?newquery
# Fragment-only
URL("#section", base) # https://example.com/docs/guide/intro.html#section
URL Normalization¶
pywhatwgurl normalizes URLs during parsing according to the URL serialization rules:
Scheme Normalization¶
Schemes are lowercased per the scheme state:
Host Normalization¶
Hostnames are lowercased and IDNA-encoded per host parsing:
URL("https://EXAMPLE.COM").hostname # "example.com"
URL("https://例え.jp").hostname # "xn--r8jz45g.jp"
IDNA Processing
Domain names are processed using IDNA 2008 with CheckHyphens=false and VerifyDnsLength=false as specified in the domain to ASCII algorithm.
Path Normalization¶
Paths are percent-encoded and dot segments resolved per path state:
URL("https://example.com/a/../b/./c").pathname # "/b/c"
URL("https://example.com/hello world").pathname # "/hello%20world"
Default Ports¶
Default ports for known schemes are omitted per URL serialization:
URL("https://example.com:443").port # "" (empty, default port)
URL("https://example.com:8080").port # "8080"
Special Schemes¶
The WHATWG URL Standard defines special schemes with specific parsing rules:
| Scheme | Default Port | Notes |
|---|---|---|
http |
80 | Standard HTTP |
https |
443 | Secure HTTP |
ftp |
21 | File Transfer Protocol |
ws |
80 | WebSocket |
wss |
443 | Secure WebSocket |
file |
— | Local files |
Special schemes have additional behaviors:
from pywhatwgurl import URL
# Special schemes get a default "/" path
URL("https://example.com").pathname # "/"
# Non-special schemes preserve empty path
URL("custom://example.com").pathname # ""
# Special schemes allow backslash as path separator
URL("https://example.com\\path").pathname # "/path"
Parsing Errors¶
When a URL is invalid, a ValueError is raised. The spec defines validation errors for various failure conditions:
from pywhatwgurl import URL
# Invalid scheme
try:
URL("://no-scheme")
except ValueError:
print("Invalid URL")
# Use URL.parse() for graceful error handling
url = URL.parse("invalid url") # Returns None instead of raising
Error Messages
Error messages describe why parsing failed, helping debug malformed URLs.
Comparison with urllib.parse¶
pywhatwgurl follows the WHATWG standard while urllib.parse follows RFC 3986. Key differences:
| Scenario | pywhatwgurl | urllib.parse |
|---|---|---|
http://example.com\\path |
/path |
\\path |
http://example.com:80/path |
Port omitted | Port preserved |
| Hostname case | Always lowercase | Preserved |
| Unicode hosts | IDNA encoded | May raise error |
Why WHATWG?
The WHATWG URL Standard defines precise parsing rules for web browsers, ensuring consistent behavior across the web. RFC 3986 is a generic URI standard used in broader contexts beyond web browsing.
Further Reading¶
- WHATWG URL Standard — The complete specification
- URL parsing algorithm — How URLs are parsed
- URL serialization — How URLs are converted to strings
Next Steps¶
- Learn about URL Components
- Explore URLSearchParams