WebContent Integrationยถ
CellMage provides a powerful integration for fetching and processing website content, allowing you to extract information from web pages directly into your Jupyter notebooks and use it as context for your LLM prompts.
Installationยถ
To use the WebContent integration, install CellMage with the webcontent extra:
pip install "cellmage[webcontent]"
This will install the necessary dependencies:
requests: For HTTP requestsbeautifulsoup4: For HTML parsingmarkdownify: For HTML to Markdown conversiontrafilatura: For advanced content extraction
You can also install these dependencies separately:
pip install requests beautifulsoup4 markdownify trafilatura
Basic Usageยถ
Load the extension in your Jupyter notebook:
%load_ext cellmage.magic_commands.tools.webcontent_magic
This will register the %webcontent magic command.
To fetch content from a website:
%webcontent https://example.com
This will automatically:
Fetch the websiteโs HTML content
Clean and extract the main content (removing navigation, ads, etc.)
Convert the content to Markdown format
Add it to your conversation history as a user message, making it available as context for your LLM prompts
Advanced Usageยถ
Content Extraction Methodsยถ
The WebContent integration offers three different methods for extracting content from websites:
trafilatura (default): Uses the Trafilatura library, which is specifically designed for content extraction and works well on most websites
bs4: Uses BeautifulSoup to identify and extract the main content area of the page
simple: Simply converts the entire HTML to Markdown with minimal cleaning
You can specify which method to use:
# Use BeautifulSoup for extraction
%webcontent https://example.com --method bs4
# Use simple extraction
%webcontent https://example.com --method simple
Raw HTML Optionยถ
If you prefer to get the raw HTML content without cleaning or extraction:
%webcontent https://example.com --raw
Media and Link Optionsยถ
You can control how images and links are handled:
# Include image references in the output
%webcontent https://example.com --include-images
# Remove hyperlinks from the output
%webcontent https://example.com --no-links
Network Optionsยถ
Adjust the request timeout if needed:
# Set a custom timeout in seconds
%webcontent https://example.com --timeout 60
Command Optionsยถ
Option |
Description |
|---|---|
|
Add the content as a system message instead of a user message |
|
Just display the content without adding it to conversation history |
|
Clean and extract main content (default behavior) |
|
Get raw HTML content without cleaning |
|
Content extraction method: trafilatura (default), bs4, or simple |
|
Include image references in the output |
|
Remove hyperlinks from the output |
|
Request timeout in seconds (default: 30) |
Examplesยถ
Fetch website content with default settings (clean content as Markdown):
%webcontent https://example.com
Fetch raw HTML content:
%webcontent https://example.com --raw
Fetch content and add it as system context:
%webcontent https://example.com --system
Just display content without adding to history:
%webcontent https://example.com --show
Use BeautifulSoup for extraction and include images:
%webcontent https://example.com --method bs4 --include-images
Using WebContent with LLM Queriesยถ
After fetching website content, you can reference it in your LLM prompts:
# First, fetch the website content
%webcontent https://example.com
# Then, ask the LLM about it
%%llm
Summarize the key points from the website content above.
You can also combine WebContent with other integrations:
# Fetch project documentation from a website
%webcontent https://example.com/docs --system
# Fetch a GitHub repository
%github username/repo
# Ask LLM to analyze both
%%llm
Compare the repository code with the documentation website.
Are there any inconsistencies or missing features?
Troubleshootingยถ
Connection Issuesยถ
Connection errors:
Check if the website is accessible in your browser
Verify your network connection
Some websites might be blocking requests from scripts or have rate limiting
Try increasing the timeout:
%webcontent https://example.com --timeout 60
SSL/TLS errors:
Some websites might have certificate issues
If you trust the website, consult the requests library documentation for SSL verification options
Content Extraction Issuesยถ
Poor content extraction:
Try a different extraction method:
%webcontent https://example.com --method bs4If all extraction methods fail, try raw mode:
%webcontent https://example.com --rawSome websites use complex JavaScript to render content, which might not be accessible through simple HTTP requests
Missing images or links:
By default, images are excluded. Use
--include-imagesto include themCheck if the links are relative paths, which might not work out of context
Very large content:
Some websites have a lot of content which might increase token usage with your LLM
Consider using more specific URLs that point to specific pages or sections
For persistent issues, enable debug logging:
import logging
from cellmage.utils.logging import setup_logging
setup_logging(level=logging.DEBUG)
# The logs will be written to cellmage.log in your working directory