Build Your Website Downloader Program in 30 Minutes### Why build a website downloader?
A website downloader is a tool that fetches pages, assets (HTML, CSS, JavaScript, images), and optionally restructures them for offline viewing or analysis. Building one yourself teaches important web concepts—HTTP, parsing, concurrency, and respectful scraping practices—while giving you a customizable utility for backups, research, or offline demos.
What this guide covers
- A working Python downloader you can build in ~30 minutes
- Explanations of key parts: fetching, parsing, asset handling, and concurrency
- Safety, legal, and ethical considerations
- Tips for expanding and hardening the tool
Prerequisites
- Basic Python knowledge (functions, modules, I/O)
- Python 3.8+ installed
- pip for installing packages
We’ll use requests for HTTP, BeautifulSoup for HTML parsing, and asyncio + aiohttp for concurrency. Install needed libraries:
pip install requests beautifulsoup4 aiohttp aiofiles yarl
Project structure
A small, clear structure keeps the tool maintainable:
- downloader/
- downloader.py (main program)
- utils.py (helper functions)
- templates/ (optional HTML templates)
- output/ (downloaded site)
For this 30-minute build we’ll put everything in one file, downloader.py.
The approach (high level)
- Fetch a page (HTTP GET).
- Parse HTML to find asset links (css, js, img, link[href], a[href]).
- Normalize and filter URLs (same domain or allowed).
- Download assets concurrently.
- Rewrite HTML to point to local asset paths.
- Save files to output folder.
- Optionally follow internal links to crawl the site.
downloader.py — the code
#!/usr/bin/env python3 """ Simple Website Downloader Usage: python downloader.py https://example.com output_dir """ import asyncio import os import sys import logging from urllib.parse import urljoin, urlparse, urldefrag import aiohttp import aiofiles from bs4 import BeautifulSoup logging.basicConfig(level=logging.INFO) logger = logging.getLogger("downloader") MAX_CONCURRENT = 10 USER_AGENT = "SimpleSiteDownloader/1.0 (+https://example.com)" sem = asyncio.Semaphore(MAX_CONCURRENT) def ensure_dir(path): os.makedirs(path, exist_ok=True) def same_domain(base, url): try: return urlparse(base).netloc == urlparse(url).netloc except Exception: return False def make_local_path(base_url, target_url, outdir): p = urlparse(target_url).path if p.endswith("/"): p = p + "index.html" if p == "" or p == "/": p = "/index.html" local_path = os.path.join(outdir, p.lstrip("/")) local_dir = os.path.dirname(local_path) ensure_dir(local_dir) return local_path async def fetch(session, url): headers = {"User-Agent": USER_AGENT} async with sem: try: async with session.get(url, headers=headers, timeout=30) as resp: content = await resp.read() return resp.status, resp.headers.get("Content-Type", ""), content except Exception as e: logger.warning(f"Failed fetch {url}: {e}") return None, None, None async def save_bytes(path, data): async with aiofiles.open(path, "wb") as f: await f.write(data) def extract_links(html, base_url): soup = BeautifulSoup(html, "html.parser") tags = [] # assets for tag in soup.find_all(["img", "script", "link"]): if tag.name == "img" and tag.get("src"): tags.append(("src", tag["src"], tag)) if tag.name == "script" and tag.get("src"): tags.append(("src", tag["src"], tag)) if tag.name == "link" and tag.get("href") and tag.get("rel") and "stylesheet" in tag.get("rel"): tags.append(("href", tag["href"], tag)) # internal pages for a in soup.find_all("a", href=True): tags.append(("href", a["href"], a)) return soup, tags async def download_page(session, url, base_url, outdir, seen, queue): status, ctype, content = await fetch(session, url) if status is None or status >= 400 or content is None: logger.info(f"Skipping {url} (status={status})") return # decode for parsing try: html = content.decode("utf-8", errors="ignore") except Exception: logger.info(f"Non-text content at {url}, saving raw.") path = make_local_path(base_url, url, outdir) await save_bytes(path, content) return soup, tags = extract_links(html, base_url) for attr, link, tag in tags: link_clean, _ = urldefrag(urljoin(url, link)) if not link_clean.startswith(("http://", "https://")): continue # assets: same-domain or absolute allowed if tag.name in ("img", "script", "link"): local_path = make_local_path(base_url, link_clean, outdir) rel_path = os.path.relpath(local_path, start=os.path.dirname(make_local_path(base_url, url, outdir))) tag[attr] = rel_path.replace(os.sep, "/") if link_clean not in seen: seen.add(link_clean) queue.append(asyncio.create_task(download_asset(session, link_clean, local_path))) # pages: queue for crawling if same domain elif tag.name == "a": if same_domain(base_url, link_clean): if link_clean not in seen: seen.add(link_clean) queue.append(asyncio.create_task(download_page(session, link_clean, base_url, outdir, seen, queue))) # save the rewritten HTML outpath = make_local_path(base_url, url, outdir) async with aiofiles.open(outpath, "w", encoding="utf-8") as f: await f.write(str(soup)) logger.info(f"Saved page {url} -> {outpath}") async def download_asset(session, url, local_path): status, ctype, content = await fetch(session, url) if status is None or status >= 400 or content is None: logger.info(f"Skipping asset {url} (status={status})") return await save_bytes(local_path, content) logger.info(f"Saved asset {url} -> {local_path}") async def main(start_url, outdir): ensure_dir(outdir) connector = aiohttp.TCPConnector(limit_per_host=MAX_CONCURRENT, ssl=False) async with aiohttp.ClientSession(connector=connector) as session: seen = set() queue = [] seen.add(start_url) queue.append(asyncio.create_task(download_page(session, start_url, start_url, outdir, seen, queue))) # simple task-run loop while queue: task = queue.pop(0) await task if __name__ == "__main__": if len(sys.argv) < 3: print("Usage: python downloader.py <start_url> <output_dir>") sys.exit(1) start, out = sys.argv[1], sys.argv[2] asyncio.run(main(start, out))
How it works (brief)
- Uses asyncio + aiohttp for concurrent downloads.
- Parses HTML with BeautifulSoup, finds assets and internal links.
- Rewrites asset URLs to local relative paths.
- Saves pages and assets mirroring the site’s path structure.
- Limits concurrency with a semaphore.
Safety, ethics, and legality
- Respect robots.txt and the website’s terms. This script does not check robots.txt — add that for politeness.
- Rate-limit and avoid hammering smaller sites. Use delays or fewer concurrent connections.
- Don’t download private or copyrighted material without permission.
Next steps / improvements
- Add robots.txt parsing (use robotparser).
- Respect Content-Security-Policy when deciding what to rewrite.
- Support query strings and unique filenames for assets with same path but different queries.
- Add CLI flags: max depth, rate-limit, include/exclude patterns, user-agent.
- Add logging to a file and resume capability.
This script gives a minimal, usable website downloader you can enhance to fit production needs.
Leave a Reply