The Crunchbase Graph : importing data into Neo4j

How to turn data into a graph? As part of our series on Crunchbase, we are going to see how to transform a spreadsheet into a Neo4j graph.

From a spreadsheet to a graph

Last time we elaborated a model that described how we could populate a graph with the data found in the monthly Crunchbase dataset. That model is going to be our map : it tells what our final dataset will look like.

Complete graph data model for Crunchbase.

Complete graph data model for Crunchbase.

Right now though, our data is stored in a spreadsheet.

The first step of our journey will be to export the data from the spreadsheet to CSV files. With OpenOffice, we need to select the “Companies” spreadsheet by clicking on the bottom left corner. Then we select “File” and “Save As”. Next we need to choose to the “CSV text” data format. Hit save.

Repeat the same steps for the “Investments” and “Acquistions” sheets.

Now we have three CSV files. I have uploaded them here : Investments, Acquisitions and Companies. We are going to use the built CSV import functionality of Neo4j to turn our files into a graph.

Importing the CSV files into Neo4j

Neo4j is a graph database. It includes a query language called Cypher. We can use Cypher to turn the data stored into our CSV files into nice nodes and edges.

I have prepared a script for that :

Here are a few remarks about this script :

  • I’m starting by deleting any nodes and edges that could be already in the database ;
  • next, I’m asserting a few constraints to make sure no duplicates will be created in the importation process ;
  • then, the actual importation can start with the creation of the nodes form the “Companies” CSV (we need nodes to create edges!) ;
  • the “LOAD CSV WITH HEADERS FROM” command means I can use the names of the columns ;
  • before starting the creation of the edges, I create an index on the companies to speed their lookups ;
  • I’m splitting the creation of edges into different “LOAD” statements :  it seems to speed up the process ;

In order to run the script on Windows we are going to use the Neo4j-shell. A (slower) alternative would be to use Neo4j’s browser. In the “/bin” directory of Neo4j, right click on “Neo4jShell” and choose “Run as Administrator”.

The Neo4j Shell.

The Neo4j Shell.

Simply copy the text from Github and paste it in the terminal.

A more natural way to work with (connected) data

Most of our customers start with data stored in spreadsheets or their “enterprise” equivalent : relational databases. They are accustomed to look at data in tables (or pie charts, histograms, etc). That kind of approach can work for many domains. If you are working with connected data and asking questions related to “connections”, relational technologies fail :

Here are some common problems we have heard :

  • querying the connections in my data is slow and hard (the “join bomb” problem) ;
  • no easy way to explore the connections in my data (who knows who? what this equipment depends on?) ;
  • understanding how things are connected is too difficult ;
  • my data model does not capture the connections of my business logic ;

Graph technologies are not the solution to all data problems but if you find yourself saying the sentences above, it might be interesting to check out what can be done with graphs!

Additional resources

I was not familiar with the CSV load functionality of Neo4j and found the following resources very helpful :

I had no prior knowledge of the CSV load functionality of Neo4j. I managed to transform a fairly complex spreadsheet in a graph thanks to the resources above: that means you can do it too! There is also a very supportive Neo4j community : if you’re stuck, you can find answers on Google Group (thank you Mark!) or on Stackoverflow.

 

Now that we have our data stored in a Neo4j graph database, we can start analyzing it. The next step of our series will be to look into the Crunchbase Graph and identify interesting patterns with Cypher. Stay tuned!

Tags: , , , , , , ,

11 Responses to “The Crunchbase Graph : importing data into Neo4j”

  1. karabijavadwp December 2, 2014 at 12:42 am #

    awesome post.

    check out my method of importing csv data into neo4j (via batch insertion) using my jruby gem, cadet:
    https://github.com/karabijavad/contributions-graph/blob/master/contributions-graph.rb
    also, importing json:
    https://github.com/karabijavad/congress-graph/blob/master/congress-graph.rb

    • jean December 4, 2014 at 9:49 am #

      Thanks for the link!

  2. Tyler Mitchell February 13, 2015 at 8:02 am #

    Wow great, thanks for sharing Jean.
    Did you happen to come across any self-referencing nodes as edges examples? I.e. the kind of edge list CSV you can import easily into Gephi and have it autocreate missing nodes? 🙂 Having a hard time bending my head around it importing “source”,”target” syntax files since both are “persons” from the same objects in the graph.

    • jean February 13, 2015 at 6:12 pm #

      Thank you so much Tyler. Care to explain what you mean?

  3. Ken Cherven February 24, 2015 at 2:22 pm #

    This is a great series that really shows the power of Neo4j and Linkurious. Your graph data model makes all the possibilities very clear.

    • jean February 24, 2015 at 3:40 pm #

      Thank you so much 🙂

  4. Kaisar Khatak June 13, 2015 at 5:57 am #

    Great post! Was there any null handling to handle unclean data? The script throws some errors working with the csv data files (as is) -> “QueryExecutionKernelException: Cannot merge node using null property value for”

    • Kaisar Khatak June 14, 2015 at 7:50 am #

      Disregard.

      • VKS June 7, 2017 at 5:17 pm #

        Hi – I am also getting the same error for the null value. Can you help me out with the solution pls? Thanks.

  5. Andrew Yip December 20, 2015 at 10:44 pm #

    Hi, thanks for the great post. I’m curious how to separate the |-delimited category_list into separate (:CATEGORY). Would you kindly elaborate? Thank you very much.

  6. Zachary Leahan June 29, 2016 at 2:51 am #

    This post is absolute gold. Thank you.

Leave a Reply