The Crunchbase Graph : data modelling

Crunchbase is one of the most complete database for the startup ecosystem. Startups, investors, markets, funding rounds : Crunchbase tracks how all of these are connected. In a series of blog posts we are going to analyse the Crunchbase Graph. Let’s start with the first step : modelling the data.

The graph of the startup world

Crunchbase is a public database for the startup ecosystem. It tracks companies, investors, funding rounds, people and events and has more than 500,000 data points. The data is provided by Crunchbase’s 50,000 active contributors. It is reviewed and updated by a small team of moderators. CrunchBase claims it has 2 million users accessing its database each month.

Recently, Crunchbase announced that to handle its data it would use the Neo4j graph database. That choice is part of a larger vision : the Business Graph.

Companies are formed, funded, and tested in the marketplace. Products get designed, built and launched. And entrepreneurs, investors and a big cast of business people drive everything forward. Now imagine if you could not only see a timeline of all that activity, but you could also uncover the connections between everyone and everything. That would be the Business Graph.

Crunchbase has an API and odes a monthly export of its data.

From tables to graph : data modelling

First the bad news : the data Crunchbase makes available is in a spreadsheet. Not the best way to understand what a given company or investor is connect to. Thankfully, it is possible to turn any spreadsheet into a graph. In order to do that, we will use the CSV import functionality of Neo4j.

YO DAWG I HERD YOU LIKE NEO4J

Before coding anything though, it is necessary to start with one question : how do we want to model our data aka where are my nodes and edges? Our objective is to decide how we are going to use the data present in the spreadsheet to build a graph.

In the Crunchbase spreadsheet we can find the following sheets : “Licence”, “Analysis”, “Acquisitions”, “Rounds”, “Companies”, “Additions” and “Investments”. For the purpose of our article we are going to focus on the startups and the investors who fund them. The data we need is in the sheets “Companies”, “Acquisitions” and “Investments”.

Here is what the “Companies” sheet look like :

permalinknamehomepage_urlcategory_listmarketfunding_total_usdstatuscountry_codestate_coderegioncityfunding_roundsfounded_atfounded_monthfounded_quarterfounded_yearfirst_funding_atlast_funding_at
/organization/waywire#waywirehttp://www.waywire.com|Entertainment|Politics|Social Media|News|News1 750 000acquiredUSANYNew York CityNew York101/06/20122012-062012-Q2201230/06/201230/06/2012
/organization/tv-communications&TV Communicationshttp://enjoyandtv.com|Games|Games4 000 000operatingUSACALos AngelesLos Angeles204/06/201023/09/2010
/organization/rock-your-paper‘Rock’ Your Paperhttp://www.rockyourpaper.org|Publishing|Education|Publishing40 000operatingESTTallinnTallinn126/10/20122012-102012-Q4201209/08/201209/08/2012

Let’s start with the key nodes in our model. We are interested in the startup, the investors and the funding rounds. We are going to use the columns of the spreadsheet to create them. We will create nodes in Neo4j and give them properties stored in the following columns :

  • the startups : permalink, name, homepage_url, funding_total_usd, funding_rounds, founded_at, founded_month, founded_quarter, founded_year, first_funding_at, last_funding_at (data in the “Companies” sheet) ;
  • the investors : investor_permalink, investor_name, funding_round_permalink (data in the “Investments” sheet) ;
  • the funding rounds : funding_round_code, funded_at, funded_month, funded_quarter, funded_year, raised_amount_usd (data in the “Investments” sheet) ;

Let’s take an example. In the “Companies” sheet above we are going to focus on the 2nd row. From that row we want to extract a node that will have the following properties :

Name of propertyValue of Property
permalink/organization/waywire
name#waywire
homepage_urlhttp://www.waywire.com
funding_total_usd1 750 000
funding_rounds1
founded_at01/06/2012
founded_month01/06/2012
founded_quarter2012-Q2
founded_year2012
first_funding_at30/06/2012
last_funding_at30/06/2012

In the graph we will create, there will be a node that will represent a startup called “Waywire”. It will have a series of properties including one called “name” which will have the value “#waywire”.

You may have noticed that I have chosen not to attach to my “startup” node every values that were in the “Companies” sheet. Why? Some values like “market” will be modeled as nodes connected to the startup “node”. For example, “Waywire” will be a node connected by an edge to the node “News”. Later on, this will make it easier to see how different startups are connected via the market they address.

When modelling a graph, it is important to keep in mind the questions we will be asking later.

Here is a picture of the graph we will extract from the “Companies” sheet :

The Startup data extracted from Crunchbase

The Startup data extracted from Crunchbase.

The “Investments” sheet allows us to complete our graph by connecting a few new objects. Putting it all together, with is how our complete graph data model will look like :

Complete graph data model for Crunchbase.

Complete graph data model for Crunchbase.

From a couple of sheets, we have extracted a graph data model where there are 14 different kinds of entities.

We haven’t written a single line of code yet. What we have done is think about the data we have and how we want to transform it in a graph. That process has allowed us to see a few connections that were within the original data but hard to view in a spreadsheet. The startups are connected to each other by the market they address. Investors can be connected to certain type of funding rounds. Of course, an investor and a startup can be connected via a funding round.

 

In our next article of this series, we will see how to use our data model to transform the Crunchbase excel spreadsheet in a graph. For this we will be using Neo4j.

Tags: , , , , ,

Trackbacks/Pingbacks

  1. The Crunchbase Graph : importing data into Neo4j - Linkurious. See Graph Databases Easily - November 24, 2014

    […] time we elaborated a model that described how we could populate a graph with the data found in the monthly Crunchbase dataset. […]

  2. The Crunchbase Graph : analysing the graph - Linkurious. See Graph Databases Easily - February 23, 2015

    […] information about the startup eco-system. Its data is available via csv files. We have seen how to model it and load it into the Neo4j graph database. Now we are going to analyse the data to identify market […]

Leave a Reply