Migrating from urllib.parse¶
This guide helps you migrate from Python's urllib.parse module to pywhatwgurl. It covers equivalent APIs, behavioral differences, and common patterns.
Different Standards
urllib.parse follows RFC 3986 while pywhatwgurl implements the WHATWG URL Standard. These standards intentionally differ in how they parse URLs. This guide highlights where behavior diverges so you can migrate with confidence.
Quick Reference¶
urllib.parse |
pywhatwgurl | Notes |
|---|---|---|
urlparse(url) |
URL(url) |
Returns an object with named properties instead of a tuple |
urlsplit(url) |
URL(url) |
Same as above (params is not a WHATWG concept) |
urljoin(base, url) |
URL(url, base).href |
Argument order is reversed |
parse_qs(qs) |
URLSearchParams(qs) |
Returns a mapping-like object, not dict[str, list] |
parse_qsl(qs) |
list(URLSearchParams(qs).items()) |
Returns list[tuple[str, str]] |
quote(s) |
Automatic per-component encoding | See Percent-Encoding |
unquote(s) |
Automatic on parsed URL properties | Decoded values accessible via properties |
quote_plus(s) |
str(URLSearchParams({"k": s})) |
Spaces → + in query strings |
urlunparse(parts) |
Build with URL + property setters |
See Building URLs |
urlencode(query) |
str(URLSearchParams(query)) |
Accepts dict or list of tuples |
Parsing URLs¶
from urllib.parse import urlparse
result = urlparse("https://user:pass@example.com:8080/path?q=1#frag")
print(result.scheme) # "https"
print(result.netloc) # "user:pass@example.com:8080"
print(result.hostname) # "example.com"
print(result.port) # 8080 (int)
print(result.path) # "/path"
print(result.query) # "q=1"
print(result.fragment) # "frag"
from pywhatwgurl import URL
url = URL("https://user:pass@example.com:8080/path?q=1#frag")
print(url.protocol) # "https:" (includes colon)
print(url.host) # "example.com:8080"
print(url.hostname) # "example.com"
print(url.port) # "8080" (string, not int)
print(url.pathname) # "/path"
print(url.search) # "?q=1" (includes ?)
print(url.hash) # "#frag" (includes #)
print(url.username) # "user"
print(url.password) # "pass"
Key differences:
protocolincludes the trailing colon ("https:"vs"https")searchincludes the leading?("?q=1"vs"q=1")hashincludes the leading#("#frag"vs"frag")portis a string, not an integer- There is no
netloc— usehost,hostname,username,passwordseparately - There is no
params— semicolons in paths are not treated specially
Joining URLs¶
Argument Order
urljoin(base, url) takes base first, but URL(url, base) takes the URL first and base second. This matches the WHATWG URL constructor.
Query String Parsing¶
from urllib.parse import parse_qs, parse_qsl
# parse_qs returns dict with list values
result = parse_qs("tag=a&tag=b&lang=py")
print(result) # {"tag": ["a", "b"], "lang": ["py"]}
# parse_qsl returns list of tuples
result = parse_qsl("tag=a&tag=b&lang=py")
print(result) # [("tag", "a"), ("tag", "b"), ("lang", "py")]
from pywhatwgurl import URLSearchParams
params = URLSearchParams("tag=a&tag=b&lang=py")
# Dictionary-style access (first value)
print(params["lang"]) # "py"
# All values for a key
print(params.get_all("tag")) # ("a", "b")
# As list of tuples (like parse_qsl)
print(list(params.items())) # [("tag", "a"), ("tag", "b"), ("lang", "py")]
Separator Differences
parse_qs treats both & and ; as separators by default. URLSearchParams only uses & per the WHATWG spec. If your data uses ; as a separator, you'll need to replace it before parsing.
Query String Building¶
Percent-Encoding¶
urllib.parse provides explicit quote() / unquote() functions. pywhatwgurl encodes automatically when you set URL properties, using the correct percent-encode set for each component.
For standalone encoding, use percent_encode_after_encoding:
from pywhatwgurl import percent_encode_after_encoding
result = percent_encode_after_encoding("hello world")
print(result) # "hello%20world"
Encoding Differences
urllib.parse.quote uses a safe parameter to specify characters that should not be encoded. WHATWG uses component-specific encode sets — the path, query, fragment, and userinfo components each have different rules for which characters require encoding. This means the same character may be encoded differently depending on where it appears.
Building URLs from Parts¶
Default Port Normalization
Notice that pywhatwgurl omits the default port (443 for HTTPS) during serialization, while urlunparse preserves whatever you pass in. This normalization is required by the WHATWG URL serializer.
Error Handling¶
from pywhatwgurl import URL
# URL() raises ValueError for invalid input
try:
url = URL("not a url")
except ValueError:
print("Invalid URL")
# Use URL.parse() for non-throwing behavior
url = URL.parse("not a url") # Returns None
# Use URL.can_parse() to check validity
if URL.can_parse("https://example.com"):
url = URL("https://example.com")
Behavioral Differences¶
The following table summarizes how the same input is parsed differently:
| Input | urllib.parse |
pywhatwgurl | Why |
|---|---|---|---|
http://example.com\path |
path = \path |
pathname = /path |
WHATWG normalizes \ to / in special schemes |
HTTP://EXAMPLE.COM |
scheme = http, netloc = EXAMPLE.COM |
protocol = http:, hostname = example.com |
WHATWG lowercases hosts |
http://example.com:80/ |
port = 80 |
port = "" |
WHATWG omits default ports |
http://例え.jp |
raises or preserves Unicode | hostname = xn--r8jz45g.jp |
WHATWG applies IDNA encoding |
http:///path |
netloc = "", path = /path |
hostname = "", pathname = /path |
Both preserve empty authority |
What Doesn't Map¶
Some urllib.parse concepts have no WHATWG equivalent:
params(semicolons in paths):urlparsesupports aparamscomponent separated by;in the path. WHATWG does not recognize this — semicolons are treated as regular path characters.netlocas a single string: WHATWG tracksusername,password,hostname, andportas separate fields. Useurl.hostforhostname:port.safeparameter for encoding: WHATWG uses fixed, component-specific encode sets. There is no way to customize which characters are encoded in the URL parser.- Scheme-generic parsing:
urllib.parseparses anyscheme://...generically. WHATWG has special schemes (http,https,ftp,ws,wss,file) with specific rules, and treats non-special schemes differently (opaque paths, no authority).
Common Patterns¶
Extracting the Domain¶
from pywhatwgurl import URL
url = URL("https://subdomain.example.com:8080/path")
print(url.hostname) # "subdomain.example.com"
Checking if a URL is Valid¶
Normalizing a URL¶
from pywhatwgurl import URL
# WHATWG parsing + serialization normalizes the URL
normalized = URL("HTTP://Example.COM:80/a/../b").href
print(normalized) # "http://example.com/b"
Modifying Query Parameters¶
from pywhatwgurl import URL
url = URL("https://example.com/search?q=python&page=1")
url.search_params["page"] = "2"
url.search_params["sort"] = "date"
del url.search_params["q"]
print(url.href) # "https://example.com/search?page=2&sort=date"
Further Reading¶
- URL Parsing — How pywhatwgurl parses URLs
- URL Components — All URL properties explained
- URLSearchParams — Query string manipulation
- WHATWG URL Standard — The complete specification