White House News Scraper
A robust, modular Python-based web scraper designed to extract news articles from whitehouse.gov/news.
Project Logic & Workflow
The scraper follows a structured workflow to ensure data integrity, efficiency, and reusability. By separating the orchestration from the utility functions, the project remains easy to maintain and extend.
1. Initialization
The process begins by ensuring the environment is ready. The init_storage function creates necessary directories and initializes the tracking CSV with the correct headers. It also handles migrations for existing CSV files.
def init_storage():
"""Initializes the output folder and tracking CSV if they don't exist."""
if not os.path.exists(OUTPUT_FOLDER):
os.makedirs(OUTPUT_FOLDER)
headers = ['date_created', 'date_collected', 'article_name', 'category']
# If the file exists, we verify headers and migrate if necessary
# Otherwise, we create a fresh one
if not os.path.exists(TRACKING_FILE):
with open(TRACKING_FILE, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(headers)
2. Scraping Article Links
On any news catalog page, the scraper identifies individual article blocks and extracts the title, link, date, and category.
def get_article_links(url):
"""Scrapes article metadata from a list page."""
response = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.select('li.wp-block-post')
data = []
for article in articles:
title_tag = article.select_one('.wp-block-post-title a')
meta = article.select_one('.wp-block-whitehouse-post-template__meta')
data.append({
'title': title_tag.get_text(strip=True),
'link': title_tag['href'],
'date': meta.find('time').get_text(strip=True),
'category': meta.select_one('.taxonomy-category a').get_text(strip=True)
})
return data
3. Protecting Against Duplicates
To prevent redundant network requests and duplicate files, the scraper checks every article title against a local database (article_tracking.csv) before proceeding to the full-page scrape.
def is_already_collected(title):
"""Checks if an article with the given title has already been collected."""
if not os.path.exists(TRACKING_FILE):
return False
with open(TRACKING_FILE, 'r', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
if row['article_name'] == title:
return True
return False
4. Fetching Full Content
For new articles, the scraper navigates to the individual article page and extracts the clean body text, removing navigation and layout elements.
def get_article_content(url):
"""Scrapes the main text content of an individual article."""
response = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')
content_div = soup.select_one('.wp-block-post-content') or soup.select_one('article')
return content_div.get_text(separator='\n', strip=True) if content_div else ""
5. Data Capture & Storage
Finally, the data is saved as a version-controlled JSON file, and the central tracking CSV is updated to prevent future re-scraping.
def save_article(article_data):
"""Saves article data to a JSON file and updates tracking CSV."""
filename = f"{slugify(article_data['title'][:50])}.json"
filepath = os.path.join(OUTPUT_FOLDER, filename)
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(article_data, f, indent=4, ensure_ascii=False)
update_tracking_csv(article_data['title'], article_data['date'], article_data['category'])
6. Multi-Page Orchestration
All the above steps are managed by the orchestration function, which provides a high-level API for scraping ranges of pages.
def scrape_news_pages(start_page=1, end_page=1):
"""Orchestrates scraping across a range of pages."""
init_storage()
for page_num in range(start_page, end_page + 1):
url = f".../news/page/{page_num}/" if page_num > 1 else ".../news/"
articles = get_article_links(url)
for article in articles:
if is_already_collected(article['title']):
continue
article['content'] = get_article_content(article['link'])
save_article(article)
Summary of Design
The scraper’s architecture prioritizes minimizing network overhead by performing local checks before every external request. By using native Python modules, it remains highly performant and portable across different environments.
TODO
- Add error handling for connection timeouts
- Support multi-threaded scraping
Last Updated: 2026-01-27