--- title: "Web Crawler Connector" description: "Crawl and sync websites automatically with scheduled recrawling and robots.txt compliance" icon: "globe" --- Connect websites to automatically crawl and sync web pages into your Supermemory knowledge base. The web crawler respects robots.txt rules, includes SSRF protection, and automatically recrawls sites on a schedule. The web crawler connector requires a **Scale Plan** or **Enterprise Plan**. ## Quick Setup ### 1. Create Web Crawler Connection ```typescript import Supermemory from 'supermemory'; const client = new Supermemory({ apiKey: process.env.SUPERMEMORY_API_KEY! }); const connection = await client.connections.create('web-crawler', { redirectUrl: 'https://yourapp.com/callback', containerTags: ['user-123', 'website-sync'], documentLimit: 5000, metadata: { startUrl: 'https://docs.example.com' } }); // Web crawler doesn't require OAuth - connection is ready immediately console.log('Connection ID:', connection.id); console.log('Connection created:', connection.createdAt); // Note: connection.authLink is undefined for web-crawler ``` ```python from supermemory import Supermemory import os client = Supermemory(api_key=os.environ.get("SUPERMEMORY_API_KEY")) connection = client.connections.create( 'web-crawler', redirect_url='https://yourapp.com/callback', container_tags=['user-123', 'website-sync'], document_limit=5000, metadata={ 'startUrl': 'https://docs.example.com' } ) # Web crawler doesn't require OAuth - connection is ready immediately print(f'Connection ID: {connection.id}') print(f'Connection created: {connection.created_at}') # Note: connection.auth_link is None for web-crawler ``` ```bash curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "redirectUrl": "https://yourapp.com/callback", "containerTags": ["user-123", "website-sync"], "documentLimit": 5000, "metadata": { "startUrl": "https://docs.example.com" } }' # Response: { # "id": "conn_wc123", # "redirectsTo": "https://yourapp.com/callback", # "authLink": null, # "expiresIn": null # } ``` ### 2. Connection Established Unlike other connectors, the web crawler doesn't require OAuth authentication. The connection is established immediately upon creation, and crawling begins automatically. ### 3. Monitor Sync Progress ```typescript // Check connection details const connection = await client.connections.getByTags('web-crawler', { containerTags: ['user-123', 'website-sync'] }); console.log('Start URL:', connection.metadata?.startUrl); console.log('Connection created:', connection.createdAt); // List synced web pages const documents = await client.connections.listDocuments('web-crawler', { containerTags: ['user-123', 'website-sync'] }); console.log(`Synced ${documents.length} web pages`); ``` ```python # Check connection details connection = client.connections.get_by_tags( 'web-crawler', container_tags=['user-123', 'website-sync'] ) print(f'Start URL: {connection.metadata.get("startUrl")}') print(f'Connection created: {connection.created_at}') # List synced web pages documents = client.connections.list_documents( 'web-crawler', container_tags=['user-123', 'website-sync'] ) print(f'Synced {len(documents)} web pages') ``` ```bash # Get connection details by provider and tags curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/connection" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"containerTags": ["user-123", "website-sync"]}' # Response includes connection details: # { # "id": "conn_wc123", # "provider": "web-crawler", # "createdAt": "2024-01-15T10:00:00Z", # "documentLimit": 5000, # "metadata": {"startUrl": "https://docs.example.com", ...} # } # List synced documents curl -X POST "https://api.supermemory.ai/v3/connections/web-crawler/documents" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"containerTags": ["user-123", "website-sync"]}' # Response: Array of document objects # [ # {"title": "Home Page", "type": "webpage", "status": "done", "url": "https://docs.example.com"}, # {"title": "Getting Started", "type": "webpage", "status": "done", "url": "https://docs.example.com/getting-started"} # ] ``` ## Supported Content Types ### Web Pages - **HTML content** extracted and converted to markdown - **Same-domain crawling** only (respects hostname boundaries) - **Robots.txt compliance** - respects disallow rules - **Content filtering** - only HTML pages (skips non-HTML content) ### URL Requirements The web crawler only processes valid public URLs: - Must be a public URL (not localhost, private IPs, or internal domains) - Must be accessible from the internet - Must return HTML content (non-HTML files are skipped) ## Sync Mechanism The web crawler uses **scheduled recrawling** rather than real-time webhooks: - **Initial Crawl**: Begins immediately after connection creation - **Scheduled Recrawling**: Automatically recrawls sites that haven't been synced in 7+ days - **No Real-time Updates**: Unlike other connectors, web crawler doesn't support webhook-based real-time sync The recrawl schedule is automatically assigned when the connection is created. Sites are recrawled periodically to keep content up to date, but updates are not instantaneous. ## Connection Management ### List All Connections ```typescript // List all web crawler connections const connections = await client.connections.list({ containerTags: ['user-123'] }); const webCrawlerConnections = connections.filter( conn => conn.provider === 'web-crawler' ); webCrawlerConnections.forEach(conn => { console.log(`Start URL: ${conn.metadata?.startUrl}`); console.log(`Connection ID: ${conn.id}`); console.log(`Created: ${conn.createdAt}`); }); ``` ```python # List all web crawler connections connections = client.connections.list(container_tags=['user-123']) web_crawler_connections = [ conn for conn in connections if conn.provider == 'web-crawler' ] for conn in web_crawler_connections: print(f'Start URL: {conn.metadata.get("startUrl")}') print(f'Connection ID: {conn.id}') print(f'Created: {conn.created_at}') ``` ```bash # List all connections curl -X POST "https://api.supermemory.ai/v3/connections/list" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"containerTags": ["user-123"]}' # Response: [ # { # "id": "conn_wc123", # "provider": "web-crawler", # "createdAt": "2024-01-15T10:30:00.000Z", # "documentLimit": 5000, # "metadata": {"startUrl": "https://docs.example.com", ...} # } # ] ``` ### Delete Connection Remove a web crawler connection when no longer needed: ```typescript // Delete by connection ID const result = await client.connections.delete('connection_id_123'); console.log('Deleted connection:', result.id); // Delete by provider and container tags const providerResult = await client.connections.deleteByProvider('web-crawler', { containerTags: ['user-123'] }); console.log('Deleted web crawler connection for user'); ``` ```python # Delete by connection ID result = client.connections.delete('connection_id_123') print(f'Deleted connection: {result.id}') # Delete by provider and container tags provider_result = client.connections.delete_by_provider( 'web-crawler', container_tags=['user-123'] ) print('Deleted web crawler connection for user') ``` ```bash # Delete by connection ID curl -X DELETE "https://api.supermemory.ai/v3/connections/connection_id_123" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" # Delete by provider and container tags curl -X DELETE "https://api.supermemory.ai/v3/connections/web-crawler" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{"containerTags": ["user-123"]}' ``` Deleting a connection will: - Stop all future crawls from the website - Keep existing synced documents in Supermemory (they won't be deleted) - Remove the connection configuration ## Advanced Configuration ### Content Filtering Control which web pages get synced using the settings API: ```typescript // Configure intelligent filtering for web content await client.settings.update({ shouldLLMFilter: true, includeItems: { urlPatterns: ['*docs*', '*documentation*', '*guide*'], titlePatterns: ['*Getting Started*', '*API Reference*', '*Tutorial*'] }, excludeItems: { urlPatterns: ['*admin*', '*private*', '*test*'], titlePatterns: ['*Draft*', '*Archive*', '*Old*'] }, filterPrompt: "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." }); ``` ```python # Configure intelligent filtering for web content client.settings.update( should_llm_filter=True, include_items={ 'urlPatterns': ['*docs*', '*documentation*', '*guide*'], 'titlePatterns': ['*Getting Started*', '*API Reference*', '*Tutorial*'] }, exclude_items={ 'urlPatterns': ['*admin*', '*private*', '*test*'], 'titlePatterns': ['*Draft*', '*Archive*', '*Old*'] }, filter_prompt="Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." ) ``` ```bash # Configure intelligent filtering for web content curl -X PATCH "https://api.supermemory.ai/v3/settings" \ -H "Authorization: Bearer $SUPERMEMORY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "shouldLLMFilter": true, "includeItems": { "urlPatterns": ["*docs*", "*documentation*", "*guide*"], "titlePatterns": ["*Getting Started*", "*API Reference*", "*Tutorial*"] }, "excludeItems": { "urlPatterns": ["*admin*", "*private*", "*test*"], "titlePatterns": ["*Draft*", "*Archive*", "*Old*"] }, "filterPrompt": "Sync documentation pages, guides, and API references. Skip admin pages, private content, drafts, and archived pages." }' ``` ## Security & Compliance ### SSRF Protection Built-in protection against Server-Side Request Forgery (SSRF) attacks: - Blocks private IP addresses (10.x.x.x, 192.168.x.x, 172.16-31.x.x) - Blocks localhost and internal domains - Blocks cloud metadata endpoints - Only allows public, internet-accessible URLs ### URL Validation All URLs are validated before crawling: - Must be valid HTTP/HTTPS URLs - Must be publicly accessible - Must return HTML content - Response size limited to 10MB **Important Limitations:** - Requires Scale Plan or Enterprise Plan - Only crawls same-domain URLs - Scheduled recrawling means updates are not real-time - Large websites may take significant time to crawl initially - Robots.txt restrictions may prevent crawling some pages - URLs must be publicly accessible (no authentication required)