Lightfeed Extractor: LLM-Powered Web Scraping Library

Lightfeed Extractor is a TypeScript library that uses LLMs and Playwright to extract structured data from web pages. It features advanced tools for JSON recovery, stealth browser automation, and HTML-to-markdown conversion to optimize token usage. The library supports various LLM providers and is built to handle complex extraction tasks in local or serverless environments.

Key Points

Integrates Playwright with LLMs to transform unstructured web content into structured data based on Zod schemas.
Features built-in JSON recovery and URL validation to ensure high data integrity and resilience against malformed model outputs.
Supports multiple browser environments including local, serverless (AWS Lambda), and remote WebSocket connections.
Provides advanced HTML-to-markdown conversion and URL cleaning to optimize token usage and improve data quality.
Works alongside a broader ecosystem including an AI browser agent for natural language navigation and a hosted platform for retail intelligence.

Sentiment

The community was predominantly skeptical and critical. The strongest pushback centered on the ethical implications of the library's anti-bot features and lack of robots.txt compliance, with multiple independent commenters raising this concern. Technical credibility was also questioned, with some dismissing the malformed JSON problem as outdated. The author was responsive and committed to addressing robots.txt compliance, which somewhat tempered the criticism.

In Agreement

XML closing tags provide structural anchors that help LLMs track position during generation, making JSON output inherently more error-prone for complex schemas
Partial data recovery is valuable in production pipelines where one malformed object in a large array would otherwise cause total extraction failure
The HTML-to-markdown conversion approach is a practical way to reduce token usage while preserving meaningful content

Opposed

The library explicitly avoids bot detection while not respecting robots.txt, which constitutes bypassing access restrictions regardless of how it's framed
Modern LLMs with structured output modes no longer produce malformed JSON, undermining the library's core value proposition
The project's documentation and author replies appear AI-generated, raising concerns about authenticity and quality
The distinction between 'preventing browser detection' and 'bypassing access restrictions' is an arbitrary semantic distinction with no practical difference