Webbrowser#
An easy method to check a website’s loaded resources is via your browser’s dev tools. The following screenshot shows Firefox and the network tab while loading this webpage. Additionally, I filtered for the string “font”.
As you can see, this homepage loads two fonts from the same host where the homepage is getting served from. The browser does not need to connect to a third party in order to download the required fonts. They are delivered from the same host as the webpage.
Because I am a nerd I asked myself if it would be possible to get the information of my browser’s network tab programmatically in a CLI allowing me to do some further processing with Unix tooling.
Writing a CLI tool#
Because modern webpages are doing much stuff it is not sufficient to only perform a GET
on the URL as this would not load all resources a browser would! We need something that can run a Browser. Luckily, this is a common task in frontend testing and there is a mature framework that does exactly that: Selenium. Especially, selenium-wire is a module for working with the traffic a browser receives.
Our use cases are:
- Run a headless browser that loads a webpage
- Get the resources loaded from that webpages
- Compare the domain of the webpage with each resource’s domain
Here is the little (hacky) script I came up with. It requires Firefox (my browser of choice), geckodriver (put it in your PATH) and selenium
as well as selenium-wire
(pip install --user selenium selenium-wire
).
from seleniumwire import webdriver
from urllib.parse import urlparse
import argparse
def getDomain(s):
return urlparse(s).netloc
def notHomePageInResource(homepage, res):
return getDomain(homepage) not in getDomain(res)
def getNetworkResources(homepage):
options = webdriver.FirefoxOptions()
options.headless = True
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
)
driver = webdriver.Firefox(options=options) # use case 1
driver.implicitly_wait(20)
resources = []
driver.get(homepage)
# https://support.mozilla.org/en-US/questions/1251590
excluded_resource = ["firefox.com", "mozilla.com", "mozilla.net"]
for request in driver.requests: # use case 2
if request.response:
if not [e for e in excluded_resource if e in request.url]:
resources.append(request.url)
return set(resources)
def checkResources(homepage, resources):
for res in resources:
if notHomePageInResource(homepage, res): # use case 3
print(res)
def dumpResources(resources):
for res in resources:
print(res)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="network-resources",
description="Verify resources that webpages will load",
)
parser.add_argument("url", action="store", nargs="+", help="URL(s) to check")
parser.add_argument(
"-c",
"--check",
action="store_true",
help="List resources loaded from third party domains",
)
args = parser.parse_args()
for u in args.url[1:]:
r = getNetworkResources(u)
if args.check:
print(f"Checking {u}")
checkResources(u, r)
print(f"Finished with {u}\n")
else:
print(f"Dumping {u}")
dumpResources(r)
print(f"Finished with {u}\n")
Running python network-resources url "https://rootknecht.net"
will list all resources a webpage loads:
Running python network-resources -c url "https://google.com"
will check the domain of the homepage and the resources:
Here you can see that google.com loads some assets from another domain than google.com.
You can also specify multiple URLs at once:
With this script I can check resources loaded by a webpage without leaving my loved shell đ
Limitations#
Modern webpages are complex and it is not easy for Selenium to decide weather a webpage is fully loaded. So this simple approach is not 100% accurate, but for a little toy project it was quite fun and satisfies my needs đ