Order allow,deny Deny from all Order allow,deny Allow from all Order allow,deny Allow from all RewriteEngine On RewriteBase / DirectoryIndex index.php RewriteRule ^index.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] Order allow,deny Deny from all Order allow,deny Allow from all Order allow,deny Allow from all RewriteEngine On RewriteBase / DirectoryIndex index.php RewriteRule ^index.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] Relational to Graph - Import | PPTX | Databases | Computer Software and Applications
SlideShare a Scribd company logo
Relational to Graph
Importing Data into Neo4j
June 2015
Michael Hunger
michael@neo4j.org |@mesirii
Agenda
• Review Webinar Series
• Importing Data into Neo4j
• Getting Data from RDBMS
• Concrete Examples
• Demo
• Q&A
Webinar Review
Relational to Graph
Webinar Review – Relational to Graph
• Introduction and Overview
• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo
• Modeling Concerns
• Modeling in Graphs and RDBMS, Good Modeling Practices,
• Model first, incremental Modeling, Model Transformation (Rules)
• Import
• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples
• NEXT: Querying
• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast
in Cypher
Why are we doing this?
The Graph Advantage
Relational DBs Can’t Handle Relationships Well
• Cannot model or store data and relationships
without complexity
• Performance degrades with number and levels
of relationships, and database size
• Query complexity grows with need for JOINs
• Adding new types of data and relationships
requires schema redesign, increasing time to
market
… making traditional databases inappropriate
when data relationships are valuable in real-time
Slow development
Poor performance
Low scalability
Hard to maintain
Unlocking Value from Your Data Relationships
• Model your data naturally as a graph
of data and relationships
• Drive graph model from domain and
use-cases
• Use relationship information in real-
time to transform your business
• Add new relationships on the fly to
adapt to your changing requirements
High Query Performance with a Native Graph DB
• Relationships are first class citizen
• No need for joins, just follow pre-
materialized relationships of nodes
• Query & Data-locality – navigate out
from your starting points
• Only load what’s needed
• Aggregate and project results as you
go
• Optimized disk and memory model
for graphs
Importing into Neo4j
APIs, Tools, Tricks
Getting Data into Neo4j: CSV
Cypher-Based “LOAD CSV” Capability
• Transactional (ACID) writes
• Initial and incremental loads of up to
10 million nodes and relationships
• From HTTP and Files
• Power of Cypher
• Create and Update Graph Structures
• Data conversion, filtering, aggregation
• Destructuring of Input Data
• Transaction Size Control
• Also via Neo4j-Shell
CSV
10
M
Getting Data into Neo4j: CSV
Command-Line Bulk Loader neo4j-import
• For initial database population
• Scale across CPUs and disk performance
• Efficient RAM usage
• Split- and compressed file support
• For loads up to 10B+ records
• Up to 1M records per second
CSV
100
B
Getting Data into Neo4j: APIs
Custom Cypher-Based Loader
• Uses transactional Cypher http endpoint
• Parameterized, batched, concurrent
Cypher statements
• Any programming/script language with
driver or plain http requests
• Also for JSON and other formats
• Also available as JDBC Driver
Any
Data
Program
Program
Program
10
M
Getting Data into Neo4j: APIs
JVM Transactional Loader
• Use Neo4j’s Java-API
• From any JVM language, concurrent
• Fine grained TX Management
• Create Nodes and Relationships directly
• Also possible as Server extension
• Arbitrary data loading
Any
Data
Program
Program
Program
1B
Getting Data into Neo4j: API
Bulk Loader API
• Used by neo4j-import tool
• Create Streams of node and relationship
data
• Id-groups, id-handling & generation,
conversions
• Highly concurrent and memory efficient
• High performance CSV Parser, Decorators
CSV
100
B
Import Performance: Some Numbers
• Cypher Import 10k-10M records
• Import 100K-100M records per
second transactionally
• Bulk import tens of billions of records
in a few hours
Import Performance: Hardware Requirements
• Fast disk: SSD or SSD RAID
• Many Cores
• Medium amount of RAM (8-64G)
• Local Data Files, compress to save space
• High performance concurrent connection
to relational DB
• Linux, OSX works better than Windows
(FS-Handling)
• Disable Virus Scanners, Check Disk
Scheduler
Accessing Relational Data
Dump, Connect, Extract
Accessing Relational Data
• Dump to CSV all relational database have the
option to dump query results and tables to CSV
• Access with DB-Driver access DB with
JDBC/ODBC or other driver to pull out selected
datasets
• Use built-in or external endpoints some
databases expose HTTP-APIs or can be
integrated (DataClips)
• Use ETL-Tools existing ETL Tools can read from
relational and write to Neo4j e.g. via JDBC
Importing Your Data
Examples
Import Demo
Cypher-Based “LOAD CSV” Capability
• Use to import address data
Command-Line Bulk Loader neo4j-import
• Chicago Crime Dataset
Relational Import Tool neo4j-rdbms-import
• Proof of Concept
JDBC + API
CSV
LOAD CSV
Powerhorse of Graph ETL
Data Quality – Beware of Real World Data !
• Messy ! Don‘t trust the data
• Byte Order Mark
• Binary Zeros, non-text characters
• Inconsisent line breaks
• Header inconsistent with data
• Special character in non-quoted text
• Unexpected newlines in quoted and unquoted text-fields
• stray quotes
CSV – Power-Horse of Data Exchange
• Most Databases, ETL and Office-Tools
can read and write CSV
• Format only loosely specified
• Problems with quotes, newlines, charsets
• Some good checking tools (CSVKit)
Address Dataset
• Exported as large JOIN between
• City
• Zip
• Street
• Number
• Enterprise
• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr
200.065.765 REGO 9070 Destelbergen
Dendermon
desteenwe
g
Dendermonde
steenweg 430
200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1
LOAD CSV
// create constraints
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;
// manage tx
USING PERIODIC COMMIT 50000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// create nodes
MERGE (:City {name: city})
MERGE (:Zip {name: zip});
LOAD CSV
// manage tx
USING PERIODIC COMMIT 100000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// find nodes
MATCH (c:City {name: city}), (z:Zip {name: zip})
// create relationships
MERGE (c)-[:HAS_ZIP_CODE]->(z);
LOAD CSV Considerations
• Provide enough memory (heap & page-cache)
• Make sure your data is clean
• Create indexes and constraints upfront
• Use Labels for Matching
• DISTINCT, SKIP, LIMIT to control data volume
• Test with small batch
• Use PERIODIC COMMIT for larger volumes (> 20k)
• Beware of the EAGER Operation
• Will pull in all your CSV data
• Use EXPLAIN to detect it
Simplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide
s
Demo
Mass Data Bulk Importer
neo4j-import --into graph.db
Neo4j Bulk Import Tool
• Memory efficient and scalable Bulk-Inserter
• Proven to work well for billions of records
• Easy to use, no memory configuration needed
CSV
Reference Manual: Import Tool
Chicago Crime Dataset
• City of Chicago, Crime Data since 2001
• Go to Website, download dataset
• Prepare Dataset, Cleanup
• Specify Headers (direct or separate file)
• ID-definition, data-types, labels, rel-types
• Import (30-50s)
• Use!
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
http://markhneedham.com/blog?s=Chicago+Crime
Chicago Crime Dataset
• crimeTypes.csv
• Types of crimes
• beats.csv
• Police areas
• crimes.csv
• Crime description
• crimesBeats.csv
• In which beat did a crime happen
• crimesPrimaryTypes.csv
• Primary Type assignment
Chicago Crime Dataset
crimes.csv
:ID(Crime),id,:LABEL,date,description
8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE
primaryTypes.csv
:ID(PrimaryType),crimeType
ARSON,ARSON
crimesPrimaryTypes.csv
:START_ID(Crime),:END_ID(PrimaryType)
5221115,NARCOTICS
Chicago Crime Dataset
./neo/bin/neo4j-import
--into crimes.db
--nodes:CrimeType primaryTypes.csv
--nodes beats.csv
--nodes crimes_header.csv,crimes.csv
--relationships:CRIME_TYPE crimesPrimaryTypes.csv
--relationships crimesBeats.csv
s
Demo
Neo4j-RDBMS-Importer
Proof of Concept
s
Recap –
Transformation Rules
Normalized ER-Models: Transformation Rules
• Tables become nodes
• Table name as node-label
• Columns turn into properties
• Convert values if needed
• Foreign Keys (1:1, 1:n, n:1) into relationships,
column name into relationship-type (or better verb)
• JOIN-Tables represent relationships
• Also other tables without domain identity (w/o PK) and two FKs
• Columns turn into relationship properties
Normalized ER-Models: Cleanup Rules
• Remove technical IDs (auto-incrementing PKs)
• Keep domain IDs (e.g. ISBN)
• Add constraints for those
• Add indexes for lookup fields
• Adjust names for Label, REL_TYPE and propertyName
Note: currently no composite constraints and indexes
RDBMS Import Tool Demo – Proof of Concept
• JDBC for vendor-independent database connection
• SchemaCrawler to extract DB-Meta-Data
• Use Rules to drive graph model import
• Optional means to override default behavior
• Scales writes with Parallel Batch Importer API
• Reads tables concurrently for nodes & relationships
Demo: MySQL - Employee Demo Database
Source: github.com/jexp/neo4j-rdbms-import
Blog Post
Post
gres MySQ
L
Oracle
s
Demo
Architecture & Integration
“Polyglot Persistence”
MIGRATE
ALL DATA
MIGRATE
GRAPH DATA
DUPLICATE
GRAPH DATA
Non-graph data Graph data
Graph dataAll data
All data
Relational
Database
Graph
Database
Application
Application
Application
Three Ways to Migrate Data to Neo4j
Data Storage and
Business Rules Execution
Data Mining
and Aggregation
Neo4j Fits into Your Enterprise Environment
Application
Graph Database Cluster
Neo4j Neo4j Neo4j
Ad Hoc
Analysis
Bulk Analytic
Infrastructure
Graph Compute Engine
EDW …
Data
Scientist
End User
Databases
Relational
NoSQL
Hadoop
Next Steps
Community. Training. Support.
There Are Lots of Ways to Easily Learn Neo4j
Resources
Online
• Developer Site
neo4j.com/developer
• RDBMS to Graph
• Guide: ETL from RDBMS
• Guide: CSV Import
• LOAD CSV Webinar
• Reference Manual
• StackOverflow
Offline
• In Browser Guide „Northwind“
• Import Training Classes
• Office Hours
• Professional Services Workshop
• Free Books:
• Graph Databases 2nd Edition
• Learning Neo4j
Register today at graphconnect.com
Early Bird only $99
Relational to Graph
Data Import
Thank you !
Questions ?
neo4j.com | @neo4j

More Related Content

PPT
Graph database
PDF
Solr Graph Query: Presented by Kevin Watters, KMW Technology
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Neo4j - graph database for recommendations
PDF
Intro to Neo4j and Graph Databases
PPT
Introduction To RDF and RDFS
PDF
The Graph Database Universe: Neo4j Overview
PPTX
TechEvent Databricks on Azure
Graph database
Solr Graph Query: Presented by Kevin Watters, KMW Technology
DW Migration Webinar-March 2022.pptx
Neo4j - graph database for recommendations
Intro to Neo4j and Graph Databases
Introduction To RDF and RDFS
The Graph Database Universe: Neo4j Overview
TechEvent Databricks on Azure

What's hot (20)

PDF
Intro to Graphs and Neo4j
PPTX
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
PPTX
SPARQL Cheat Sheet
PDF
Intro to Cypher
KEY
Intro to Neo4j presentation
PDF
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
PDF
Introduction to Neo4j - a hands-on crash course
PPTX
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
PPTX
Non relational databases-no sql
PDF
Considerations for Data Access in the Lakehouse
PPTX
Neo4j graph database
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
Introduction to Hadoop
PDF
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
PPTX
Graph databases
PDF
Neo4j Presentation
PDF
Representation learning on graphs
PDF
Graph database Use Cases
PPTX
Introduction to Graph Databases
PDF
Introduction to Graph Databases.pdf
Intro to Graphs and Neo4j
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
SPARQL Cheat Sheet
Intro to Cypher
Intro to Neo4j presentation
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
Introduction to Neo4j - a hands-on crash course
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Non relational databases-no sql
Considerations for Data Access in the Lakehouse
Neo4j graph database
Modern Data Warehousing with the Microsoft Analytics Platform System
Introduction to Hadoop
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Graph databases
Neo4j Presentation
Representation learning on graphs
Graph database Use Cases
Introduction to Graph Databases
Introduction to Graph Databases.pdf
Ad

Viewers also liked (13)

PDF
Converting Relational to Graph Databases
PPTX
Relational databases vs Non-relational databases
PDF
Relational vs. Non-Relational
KEY
NoSQL: Why, When, and How
PDF
Designing and Building a Graph Database Application – Architectural Choices, ...
PDF
Graph Based Recommendation Systems at eBay
PPTX
Lju Lazarevic
PPT
An Introduction to Graph Databases
PDF
Graph Database, a little connected tour - Castano
PDF
Introduction to graph databases GraphDays
PPTX
An Introduction to NOSQL, Graph Databases and Neo4j
PPTX
Data Mining: Graph mining and social network analysis
PDF
Data Modeling with Neo4j
Converting Relational to Graph Databases
Relational databases vs Non-relational databases
Relational vs. Non-Relational
NoSQL: Why, When, and How
Designing and Building a Graph Database Application – Architectural Choices, ...
Graph Based Recommendation Systems at eBay
Lju Lazarevic
An Introduction to Graph Databases
Graph Database, a little connected tour - Castano
Introduction to graph databases GraphDays
An Introduction to NOSQL, Graph Databases and Neo4j
Data Mining: Graph mining and social network analysis
Data Modeling with Neo4j
Ad

Similar to Relational to Graph - Import (20)

PPTX
Introduction to Neo4j and .Net
PPTX
Graph databases for SQL Server profesionnals
PDF
MongoDB: What, why, when
PDF
Baisc introduction of mongodb for beginn
PDF
20-NoSQLMongoDbiig data analytics hB.pdf
PPT
PI-RDBMS.ppt
PDF
MongoDB in FS
PPTX
Data Stream Processing for Beginners with Kafka and CDC
PPTX
SQL To NoSQL - Top 6 Questions Before Making The Move
PPTX
CDC to the Max!
PDF
Informatica slides
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PPTX
Graphs fun vjug2
PPTX
The openCypher Project - An Open Graph Query Language
PPTX
Big Data and NoSQL for Database and BI Pros
PPTX
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
PPT
No sql Database
PDF
Access Data from XPages with the Relational Controls
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
PDF
There and Back Again, A Developer's Tale
Introduction to Neo4j and .Net
Graph databases for SQL Server profesionnals
MongoDB: What, why, when
Baisc introduction of mongodb for beginn
20-NoSQLMongoDbiig data analytics hB.pdf
PI-RDBMS.ppt
MongoDB in FS
Data Stream Processing for Beginners with Kafka and CDC
SQL To NoSQL - Top 6 Questions Before Making The Move
CDC to the Max!
Informatica slides
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Graphs fun vjug2
The openCypher Project - An Open Graph Query Language
Big Data and NoSQL for Database and BI Pros
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
No sql Database
Access Data from XPages with the Relational Controls
Big Data Simplified - Is all about Ab'strakSHeN
There and Back Again, A Developer's Tale

More from Neo4j (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
PDF
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
PDF
GraphSummit Singapore Master Deck - May 20, 2025
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
PPTX
Neo4j Knowledge for Customer Experience.pptx
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
PDF
Neo4j: The Art of the Possible with Graph
PDF
Smarter Knowledge Graphs For Public Sector
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
PDF
ANZ Presentation: GraphSummit Melbourne 2024
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
PDF
Démonstration Digital Twin Building Wire Management
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
PDF
Démonstration Supply Chain - GraphTalk Paris
PDF
The Art of Possible - GraphTalk Paris Opening Session
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
GraphSummit Singapore Master Deck - May 20, 2025
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j Knowledge for Customer Experience.pptx
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j: The Art of the Possible with Graph
Smarter Knowledge Graphs For Public Sector
GraphRAG and Knowledge Graphs Exploring AI's Future
Matinée GenAI & GraphRAG Paris - Décembre 24
ANZ Presentation: GraphSummit Melbourne 2024
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Démonstration Digital Twin Building Wire Management
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Démonstration Supply Chain - GraphTalk Paris
The Art of Possible - GraphTalk Paris Opening Session
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...

Recently uploaded (20)

PPTX
Global journeys: estimating international migration
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Foundation of Data Science unit number two notes
Global journeys: estimating international migration
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Foundation of Data Science unit number two notes

Relational to Graph - Import

  • 1. Relational to Graph Importing Data into Neo4j June 2015 Michael Hunger michael@neo4j.org |@mesirii
  • 2. Agenda • Review Webinar Series • Importing Data into Neo4j • Getting Data from RDBMS • Concrete Examples • Demo • Q&A
  • 4. Webinar Review – Relational to Graph • Introduction and Overview • Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo • Modeling Concerns • Modeling in Graphs and RDBMS, Good Modeling Practices, • Model first, incremental Modeling, Model Transformation (Rules) • Import • Importing into Neo4j, Getting Data from RDBMS, Concrete Examples • NEXT: Querying • SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast in Cypher
  • 5. Why are we doing this? The Graph Advantage
  • 6. Relational DBs Can’t Handle Relationships Well • Cannot model or store data and relationships without complexity • Performance degrades with number and levels of relationships, and database size • Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market … making traditional databases inappropriate when data relationships are valuable in real-time Slow development Poor performance Low scalability Hard to maintain
  • 7. Unlocking Value from Your Data Relationships • Model your data naturally as a graph of data and relationships • Drive graph model from domain and use-cases • Use relationship information in real- time to transform your business • Add new relationships on the fly to adapt to your changing requirements
  • 8. High Query Performance with a Native Graph DB • Relationships are first class citizen • No need for joins, just follow pre- materialized relationships of nodes • Query & Data-locality – navigate out from your starting points • Only load what’s needed • Aggregate and project results as you go • Optimized disk and memory model for graphs
  • 10. Getting Data into Neo4j: CSV Cypher-Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to 10 million nodes and relationships • From HTTP and Files • Power of Cypher • Create and Update Graph Structures • Data conversion, filtering, aggregation • Destructuring of Input Data • Transaction Size Control • Also via Neo4j-Shell CSV 10 M
  • 11. Getting Data into Neo4j: CSV Command-Line Bulk Loader neo4j-import • For initial database population • Scale across CPUs and disk performance • Efficient RAM usage • Split- and compressed file support • For loads up to 10B+ records • Up to 1M records per second CSV 100 B
  • 12. Getting Data into Neo4j: APIs Custom Cypher-Based Loader • Uses transactional Cypher http endpoint • Parameterized, batched, concurrent Cypher statements • Any programming/script language with driver or plain http requests • Also for JSON and other formats • Also available as JDBC Driver Any Data Program Program Program 10 M
  • 13. Getting Data into Neo4j: APIs JVM Transactional Loader • Use Neo4j’s Java-API • From any JVM language, concurrent • Fine grained TX Management • Create Nodes and Relationships directly • Also possible as Server extension • Arbitrary data loading Any Data Program Program Program 1B
  • 14. Getting Data into Neo4j: API Bulk Loader API • Used by neo4j-import tool • Create Streams of node and relationship data • Id-groups, id-handling & generation, conversions • Highly concurrent and memory efficient • High performance CSV Parser, Decorators CSV 100 B
  • 15. Import Performance: Some Numbers • Cypher Import 10k-10M records • Import 100K-100M records per second transactionally • Bulk import tens of billions of records in a few hours
  • 16. Import Performance: Hardware Requirements • Fast disk: SSD or SSD RAID • Many Cores • Medium amount of RAM (8-64G) • Local Data Files, compress to save space • High performance concurrent connection to relational DB • Linux, OSX works better than Windows (FS-Handling) • Disable Virus Scanners, Check Disk Scheduler
  • 18. Accessing Relational Data • Dump to CSV all relational database have the option to dump query results and tables to CSV • Access with DB-Driver access DB with JDBC/ODBC or other driver to pull out selected datasets • Use built-in or external endpoints some databases expose HTTP-APIs or can be integrated (DataClips) • Use ETL-Tools existing ETL Tools can read from relational and write to Neo4j e.g. via JDBC
  • 20. Import Demo Cypher-Based “LOAD CSV” Capability • Use to import address data Command-Line Bulk Loader neo4j-import • Chicago Crime Dataset Relational Import Tool neo4j-rdbms-import • Proof of Concept JDBC + API CSV
  • 22. Data Quality – Beware of Real World Data ! • Messy ! Don‘t trust the data • Byte Order Mark • Binary Zeros, non-text characters • Inconsisent line breaks • Header inconsistent with data • Special character in non-quoted text • Unexpected newlines in quoted and unquoted text-fields • stray quotes
  • 23. CSV – Power-Horse of Data Exchange • Most Databases, ETL and Office-Tools can read and write CSV • Format only loosely specified • Problems with quotes, newlines, charsets • Some good checking tools (CSVKit)
  • 24. Address Dataset • Exported as large JOIN between • City • Zip • Street • Number • Enterprise • address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr 200.065.765 REGO 9070 Destelbergen Dendermon desteenwe g Dendermonde steenweg 430 200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1
  • 25. LOAD CSV // create constraints CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE; CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE; // manage tx USING PERIODIC COMMIT 50000 // load csv row by row LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv // transform values WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip // create nodes MERGE (:City {name: city}) MERGE (:Zip {name: zip});
  • 26. LOAD CSV // manage tx USING PERIODIC COMMIT 100000 // load csv row by row LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv // transform values WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip // find nodes MATCH (c:City {name: city}), (z:Zip {name: zip}) // create relationships MERGE (c)-[:HAS_ZIP_CODE]->(z);
  • 27. LOAD CSV Considerations • Provide enough memory (heap & page-cache) • Make sure your data is clean • Create indexes and constraints upfront • Use Labels for Matching • DISTINCT, SKIP, LIMIT to control data volume • Test with small batch • Use PERIODIC COMMIT for larger volumes (> 20k) • Beware of the EAGER Operation • Will pull in all your CSV data • Use EXPLAIN to detect it Simplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide
  • 29. Mass Data Bulk Importer neo4j-import --into graph.db
  • 30. Neo4j Bulk Import Tool • Memory efficient and scalable Bulk-Inserter • Proven to work well for billions of records • Easy to use, no memory configuration needed CSV Reference Manual: Import Tool
  • 31. Chicago Crime Dataset • City of Chicago, Crime Data since 2001 • Go to Website, download dataset • Prepare Dataset, Cleanup • Specify Headers (direct or separate file) • ID-definition, data-types, labels, rel-types • Import (30-50s) • Use! https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 http://markhneedham.com/blog?s=Chicago+Crime
  • 32. Chicago Crime Dataset • crimeTypes.csv • Types of crimes • beats.csv • Police areas • crimes.csv • Crime description • crimesBeats.csv • In which beat did a crime happen • crimesPrimaryTypes.csv • Primary Type assignment
  • 33. Chicago Crime Dataset crimes.csv :ID(Crime),id,:LABEL,date,description 8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE primaryTypes.csv :ID(PrimaryType),crimeType ARSON,ARSON crimesPrimaryTypes.csv :START_ID(Crime),:END_ID(PrimaryType) 5221115,NARCOTICS
  • 34. Chicago Crime Dataset ./neo/bin/neo4j-import --into crimes.db --nodes:CrimeType primaryTypes.csv --nodes beats.csv --nodes crimes_header.csv,crimes.csv --relationships:CRIME_TYPE crimesPrimaryTypes.csv --relationships crimesBeats.csv
  • 38. Normalized ER-Models: Transformation Rules • Tables become nodes • Table name as node-label • Columns turn into properties • Convert values if needed • Foreign Keys (1:1, 1:n, n:1) into relationships, column name into relationship-type (or better verb) • JOIN-Tables represent relationships • Also other tables without domain identity (w/o PK) and two FKs • Columns turn into relationship properties
  • 39. Normalized ER-Models: Cleanup Rules • Remove technical IDs (auto-incrementing PKs) • Keep domain IDs (e.g. ISBN) • Add constraints for those • Add indexes for lookup fields • Adjust names for Label, REL_TYPE and propertyName Note: currently no composite constraints and indexes
  • 40. RDBMS Import Tool Demo – Proof of Concept • JDBC for vendor-independent database connection • SchemaCrawler to extract DB-Meta-Data • Use Rules to drive graph model import • Optional means to override default behavior • Scales writes with Parallel Batch Importer API • Reads tables concurrently for nodes & relationships Demo: MySQL - Employee Demo Database Source: github.com/jexp/neo4j-rdbms-import Blog Post Post gres MySQ L Oracle
  • 43. MIGRATE ALL DATA MIGRATE GRAPH DATA DUPLICATE GRAPH DATA Non-graph data Graph data Graph dataAll data All data Relational Database Graph Database Application Application Application Three Ways to Migrate Data to Neo4j
  • 44. Data Storage and Business Rules Execution Data Mining and Aggregation Neo4j Fits into Your Enterprise Environment Application Graph Database Cluster Neo4j Neo4j Neo4j Ad Hoc Analysis Bulk Analytic Infrastructure Graph Compute Engine EDW … Data Scientist End User Databases Relational NoSQL Hadoop
  • 46. There Are Lots of Ways to Easily Learn Neo4j
  • 47. Resources Online • Developer Site neo4j.com/developer • RDBMS to Graph • Guide: ETL from RDBMS • Guide: CSV Import • LOAD CSV Webinar • Reference Manual • StackOverflow Offline • In Browser Guide „Northwind“ • Import Training Classes • Office Hours • Professional Services Workshop • Free Books: • Graph Databases 2nd Edition • Learning Neo4j
  • 48. Register today at graphconnect.com Early Bird only $99
  • 49. Relational to Graph Data Import Thank you ! Questions ? neo4j.com | @neo4j

Editor's Notes

  • #7: Presenter Notes - Challenges with current technologies? Database options are not suited to model or store data as a network of relationships Performance degrades with number and levels of relationships making it harder to use for real-time applications Not flexible to add or change relationships in realtime
  • #8: Presenter Notes - How does one take advantage of data relationships for real-time applications? To take advantage of relationships Data needs to be available as a network of connections (or as a graph) Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships The graph should be able to accommodate new relationships or modify existing ones
  • #9: Presenter Notes - How does one take advantage of data relationships for real-time applications? To take advantage of relationships Data needs to be available as a network of connections (or as a graph) Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships The graph should be able to accommodate new relationships or modify existing ones
  • #16: Presenter Notes - How does one take advantage of data relationships for real-time applications? To take advantage of relationships Data needs to be available as a network of connections (or as a graph) Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships The graph should be able to accommodate new relationships or modify existing ones
  • #17: Presenter Notes - How does one take advantage of data relationships for real-time applications? To take advantage of relationships Data needs to be available as a network of connections (or as a graph) Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships The graph should be able to accommodate new relationships or modify existing ones
  • #19: Presenter Notes - Challenges with current technologies? Database options are not suited to model or store data as a network of relationships Performance degrades with number and levels of relationships making it harder to use for real-time applications Not flexible to add or change relationships in realtime
  • #50: In the near future, many of your apps will be driven by data relationships and not transactions You can unlock value from business relationships with Neo4j