Back to blog

Self-building databases

The world's first autonomously self-growing, self-structuring and self-checking database
April 16th, 2024
Self-building databases


Imagine a database that built and maintained itself.

Given a goal, and access to any body of existing information (e.g. the public internet, or private intranet), this database would find structure amidst the chaos; researching, collating, and extracting specific information you care about; while storing it in an easy-to-access way.

As this database grew, it would manage its own schemas, ensuring information is represented semantically, in whatever manner made most sense to you, in consideration of the reasons for your collection of the data in the first place. At the same time, the information would be stored in such a way that it is cross-compatibile with others' representations of the same entities, even if their types (and yours) are private to begin with.

Over time, this database would check, validate and enrich your information, ensuring its freshness and accuracy, rather than letting it become stale (as happens to data in every other database, ever). Raw energy, via compute, is converted into beautiful negentropy.

When it comes to sharing your information with other people, businesses, or service-providers, the database lets you do so with just one click. And when you want to remove their access, it's equally simple.

Information from existing external applications and databases can be trivially two-way synced with the database, and the synced data represented as semantic entities for easy management and use.

Example use cases

Inferring novel statistics

For many kinds of commercially-valuable information, providers of aggregated data and statistics already exist. For example, in the case of the financial markets, LSEG Data & Analytics or S&P's CapIQ provide many forms of standardized data.

But for more idiosyncratic data, information may not be available "off-the-shelf" or in a pre-prepared form. In those cases, organizations often spend significant time compiling this data, leading to delays in decision-making, and not inconsiderable expense (even when research is offshored, as is commonly the case).

Calculating metrics and inferring novel statistics may involve:

  • Collating hard-to-gather information from many different websites or sources
  • Talking to or requesting input from a wide array of people, including both the subjects of the statistics and watching experts
  • Synthesizing gathered information into a clean, consistently structured form
  • Performing calculations or analysis using the gathered data

Producing these statistics, which are not readily available to others, can be invaluable: enabling individuals to make better informed decisions, providing traders with "alpha" to better price opportunities, and allowing businesses to better assess trends and allocate resources.

Enriching datasets

Many organizations have "analysts" or "associates" whose roles involve large-scale data collection, research and analysis of individuals, companies, opportunities, and ideas.

For example, a venture capitalist may want to conduct research on all of the authors of a particular academic paper. Within frontier AI research groups, for example, it is not uncommon for certain papers to have 10-20 (or more!) authors. Being able to take only the name of or URL to the paper, and to automatically gather the authors' names and organizational affiliations, LinkedIn profile, Google Scholar profile, Twitter profile, GitHub profile, current title, previous employers, and so on would save hours of manually scouring the web, freeing analysts up to conduct more meaningful due diligence.

Monitoring and tracking

Automated monitoring and updating of information would be one of this self-building database's core competencies.

When it comes to product management work, this can be used to aide competitive intelligence:

  • Tracking the prices and feature-sets of competitor's products
  • Monitoring the news for press mentions
  • Keeping an eye on customer reviews to identify pain points in your own and others' offerings
  • Checking who's attending, speaking at, and sponsoring relevant industry conferences or events
  • Watching who follows and likes your competition's content in order to improve your own customer targeting
  • Extracting important relevant information automatically from never-ending product changelogs, company blogs, press releases, social media posts and other online publications

Identifying opportunities

No matter what side of an opportunity you're on, automating the legwork of identifying opportunities and qualifying/filtering them can be time-consuming. Whether you're...

  • Prospecting for investment opportunities, or seeking out potential investors
  • Headhunting potential new recruits, or researching potential open roles
  • Identifying and qualifying sales targets, or finding the best solution provider for your needs
  • Selling commercial real estate, or looking for a unique space're typically going to spend a large amount of time Googling, clicking through onto a variety of different websites and results, grappling with inadequate search filters, and manually doing a lot of the information parsing yourself.

Enabling complex systems modeling

Complex systems can be incredibly sensitive to their initial conditions. Even small differences between a variable used in a model, and the actual real-world value of a thing, can cause huge differences in simulated outcomes.

In many domains, ensuring models are as accurate as possible requires the observation and collection of timely, granular, real-world information, or "signals". Doyne Former, of the Santa Fe Institute and Oxford's Institute for New Economic Thinking, refers to the need for both these representative models and accurate, real-time data as a need for "collective awareness". Using publicly available unstructured data, we can in theory create live, composite metrics (e.g. gauges of price volatility and inflation, with respect to different types of products)... but doing so today is hard. Doing this type of thing with a self-building database becomes immensely easier.

This use case is particularly close to our hearts: HASH was founded in 2019 as a multi-agent systems research lab, building agent-based simulation modeling tools.

Technical requirements

Given the database that we "imagine", there are quite a few technical components that need to exist to make it real. Some of these are easy to find, while others do not obviously already exist.

A type system

Many people have tried to build graphs of "linked open data" before, but they haven't worked, or at least gained the dominance required to be a truly open graph rather than a series of bespoke silos, because different people conceptualize the same things differently. Not all folks agree on defining things the same way, often for very good reasons (in spite of the best efforts of projects like, and even when they do, individuals often care about different aspects of those things. For example, one person might care about how scenic a given walking route is, while a second might care about how hilly it is, while a third is concerned with its directness and efficiency.

Type systems let their users define things, and the aspects of those things that they care about.

A multi-tenant type system lets different users define entities however they like, while retaining an ability to refer to the same underlying semantic objects (our "entities" or "things") by declaring one type as being semantically the 'same as' another, even when descriptions and definitions of them differ. Great prior work in this space like Project Cambria provides a starting point for thinking about how this can be achieved.

A graph datastore

To support all the capabilities we imagine a self-building, globally-connected database needs to be truly useful, we need a new kind of graph database with native support for:

  • Multi-tenancy, allowing for the creation of a globally connected knowledge graph, integrated alongside every individual and organization's private data, with the ability to securely and selectively grant access to subgraphs of information when desired
  • Temporal versioning, enabling an understanding of how the knowledge in the graph has evolved over time
  • Provenance and confidence metadata, enabling contextual processing of information based on probabilities (taking into account its source, and your own or a consensus view of its trustworthiness, as well as technical factors that might impact individual data's reliability)
  • Strongly typed entities, which support consistency and predictability in the structure of entity data, as well as cross-references ("crosswalking") between different type definitions which refer to the same underlying thing in the world, via the type system we describe above

Agentic AI

To efficiently derive structure from the unstructured web we need intelligent agents. These agents need to be capable of running for an extremely long time, processing vast quantities of data. They need to be able to identify real patterns amidst noise, pick out the most important needles within unruly haystacks, solve problems, perform calculations, and conduct analysis on the fly... introspecting and checking their own work, seeking confirmatory and contradictory sources, and formulating confidence assessments around the "synthesis" they produce.

A task executor

Finally, to support complex agentic AI, we need a task executor capable of handling extremely long-running jobs. Thankfully, "durable execution" solutions such as Temporal are now widely available, and open-source.


This isn't future technology. In fact, HASH is already being used to solve each of the use cases outlined above.

HASH is a new kind of database for integrating information, which grows, checks, and maintains itself, with a graphical user interface (GUI) that makes its contents directly, easily and visually accessible to all of its end-users. HASH provides all of the capabilities we "imagine" in our introduction.

We founded HASH, Inc. as a research lab in 2019. In the years since, we've built the type system, graph, and application (in Rust, for speed, safety, and WASM-compatibility) steadily and quietly, testing it in the real-world throughout.

Get started


If you'd like to use HASH yourself, we're now inviting users, and you can sign up at

View on GitHub

You can also check out and star the HASH repository on GitHub, if you'd like to explore the application's source code. HASH is open-core software, with a majority of it fully open-sourced under an Apache 2.0/MIT dual-license, or AGPLv3. You can read more about our open-source philosophy and other non-negotiable commitments on our developer blog.

Get in touch

Whether you've got a use-case for HASH, you'd like to learn more, or you're interested in contributing code to the open-source project, feel free to get in touch. We'd love to hear from you!

Create a free account

Sign up to try HASH out for yourself, and see what all the fuss is about

By signing up you agree to our terms and conditions and privacy policy