CauseNet: An Open 11M-Relation Causality Graph from the Web
CauseNet is a large, open-domain graph of over 11M claimed causal relations extracted from web sources with detailed provenance and an estimated 83% precision. It offers Full, Precision, and Sample editions, tools for Neo4j loading, and datasets for training a concept spotter. The resource demonstrates benefits for causal QA and targets future applications in reasoning and argumentation.
Key Points
- CauseNet aggregates over 11 million claimed causal relations into an open-domain causality graph with an estimated 83% extraction precision.
- Data are mined from multiple web sources (ClueWeb12, Wikipedia sentences, lists, and infoboxes) and each relation includes detailed provenance metadata.
- Three dataset editions balance coverage and quality: Full, Precision subset, and a small Sample without provenance; example code supports Neo4j loading.
- A sequence-tagging concept spotter identifies multi-word causal concepts; associated training/evaluation datasets are publicly available (80/10/10 splits).
- The resource supports causal QA and broader reasoning tasks, with initial QA gains demonstrated and permissive licensing encouraging community use and extension.
Sentiment
The overall sentiment is skeptical but intellectually engaged. Most commenters see fundamental problems with the approach — either philosophical (correlation is not causation, ontologies are inherently brittle) or practical (lack of nuance, poor extraction quality, LLMs already do this better). However, the discussion is more thoughtful than hostile, with genuine engagement on deep questions about causality, knowledge representation, and the limits of symbolic AI. A notable minority defends the research value of the dataset as a starting point for further work.
In Agreement
- The concept of cataloging causal relationships has research value, with the dataset generating roughly 110 citations since 2020 and enabling hypotheses about causal structures in domains like medical diagnostics
- Provenance tracking (retaining source sentences and extraction paths) adds meaningful value beyond simple cause-effect pairs, enabling fact-checking and quality evaluation
- Even imperfect causal triples are useful starting points that researchers can prune, extend, and build upon — dismissing ontologies entirely ignores that every if-statement and ER diagram is a form of knowledge representation
- The semantic web approach of linking concepts with URIs and labeled arrows has untapped potential for open data linking and reasoning, and CauseNet contributes to keeping this space active
- The dataset could serve as supplementary training data for LLMs or as a seed for LLM-driven ontology refinement
Opposed
- Simple cause-effect pairs without conditions, probability, or context are practically useless — chaining them produces absurd sequences that reveal the brittleness of the approach
- The dataset captures claimed causation including known misinformation (vaccines → autism) because regex-based extraction misses negation and context, undermining the 83% precision claim
- LLMs already encode causal knowledge in richer, higher-dimensional representations through embeddings, making flat causal graphs redundant — expert systems using this symbolic approach were mostly failures
- Ontology-based approaches have repeatedly failed over decades due to inherent brittleness; CauseNet repeats the Cyc project's fundamental mistake of trying to manually encode world knowledge
- Using Wikipedia and web crawls as primary data sources limits credibility, and the extraction is too crude to distinguish genuine causal relationships from definitions, tautologies, and correlations