Maximizing Vocabulary Coverage with Books: A Submodular, NP-Hard Selection Problem
Article: NeutralCommunity: PositiveMixed
Choosing one book with maximal weighted vocabulary coverage is straightforward and computable in near-linear time. Selecting the best k books becomes the NP-hard maximum coverage problem, making exact solutions impractical for larger k. Greedy submodular maximization and related heuristics offer efficient, provably good approximations in practice.
Key Points
- Single-book selection is easy: compute a weighted coverage score using language-wide word frequencies; this can be done in near-linear time after optional stop-word filtering.
- Exact selection of the best two books is already quadratic; the best k-book selection is the NP-hard maximum (weighted) coverage problem.
- Because the objective is submodular and monotone, greedy algorithms provide strong approximation guarantees and are practical.
- Libraries like submodlib implement fast submodular maximization methods suitable for this task.
- Heuristics such as block additions, look-ahead, and pruning overlapping books can improve practical results without changing worst-case approximation bounds.
Sentiment
The community appreciates the algorithmic content but is skeptical about the language learning premise, with most commenters arguing that comprehensible input and graded readers are more effective than vocabulary coverage optimization.
In Agreement
- The submodular optimization problem is genuinely interesting from an algorithmic perspective and the greedy approximation approach is elegant
- Someone built a working prototype of vocabulary coverage optimization using n-grams trained on subtitles, validating the core concept
- The idea connects to historical precedents like the Chinese Thousand Character Text, a 6th-century poem using exactly 1000 unique characters as a language primer
Opposed
- The article's premise is impractical procrastination — worrying about NP-hard optimization instead of actually learning the language
- Comprehensible input theory suggests you want mostly familiar context with few new words, not maximum vocabulary exposure — the opposite of what the article optimizes for
- Graded readers and sentence-level SRS flashcards are more effective than book-level selection for language acquisition
- Language learning is fundamentally emotional and motivational, not an algorithmic optimization problem
- The approach ignores what the learner already knows, which is critical for effective vocabulary building