The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style tables could be useful for improving web search, schema design, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web’s HTML table corpus. For example, we extracted 14.1 billion HTML tables from a several-billion-page portion of Google’s generalpurpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also describe the crawl’s distribution of table sizes and data types. Second, we describe a system for performing relation recovery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1 % of good relations from the remainder, nor to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems. 1.