Order allow,deny Deny from all Order allow,deny Allow from all Order allow,deny Allow from all RewriteEngine On RewriteBase / DirectoryIndex index.php RewriteRule ^index.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] Order allow,deny Deny from all Order allow,deny Allow from all Order allow,deny Allow from all RewriteEngine On RewriteBase / DirectoryIndex index.php RewriteRule ^index.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] NoSQL and MapReduce | PPTX | Databases | Computer Software and Applications
SlideShare a Scribd company logo
NoSQL databases and MapReduceJ SinghEarly Stage IT
What’s so fun about databases?Traditional database discussions talked aboutEmployee recordsBank recordsNow we talk aboutWeb searchData miningThe collective intelligence of tweetsScientific and medical databases
How much data can a database hold?The biggest OLTP databases2001: 1.1 – 10.3 TB.2003: 9.1 – 29.2 TB.2005: 17.7 – 100.4 TB.2010: ~2.5 PB.The trend will continueVery large databases bring new unique challenges
Historical ContextLate 1990’s.The web scales out.Suddenly, databases not adequate for holding the data being accumulatedScale out vs. Scale up
Brewer’s Conjecture (p1)Source: Eric Brewer’s July 2000 PODC KeynoteMain points:Classic “Distributed Systems” don’t workThey focus on computation, not dataDistributing computation is easy, distributing data is hardDBMS research is about ACID (mostly)Atomicity, Consistency, Isolation and DurabilityBut we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamentalBASEBasically AvailableSoft-stateEventual Consistency
Brewer’s Conjecture (p2)BASEWeak consistencystale data OKAvailability firstBest effortApproximate answers OKAggressive (optimistic)Simpler!FasterEasier evolutionACIDStrong consistencyIsolationFocus on “commit”Nested transactionsAvailability?Conservative (pessimistic)Difficult evolution (e.g. schema)But I think it’s a spectrumEric Brewer
CAP TheoremSince then,Brewer’s conjecture formally proved: Gilbert & Lynch, 2002Thus Brewer’s conjecture became the CAP theorem……and contributed to the birth of the NoSQL movementBut the theory is not settledWhile http://nosql-database.org/ lists 122 NoSQL databases
What is NoSQL?Stands for Not Only SQLClass of non-relational data storage systemsUsually do not require a fixed table schema nor do they use the concept of joinsAll NoSQL offerings relax one or more of the ACID properties
Forces at WorkThree major papers were the seeds of the NoSQL movementCAP Theorem (discussed above)BigTable(Google)Dynamo (Amazon)Some types of data could not be modeled well in RDBMSDocument Storage and IndexingRecursive Data and GraphsTime Series DataGenomics Data
NoSQL DatabasesKey-Value StoresA storage system that stores values, indexed by a key.Example: Voldemort, Dynomite, Tokyo CabinetBigTable Clones (aka "ColumnFamily")A tabular model where each row (at least in theory) can have an individual configuration of columns.Example: HBase, Hypertable, Cassandra, Amazon SimpleDB
NoSQL DatabasesDocument DatabasesCollections of documents, which contain key-value collections (called "documents")Example: CouchDB, MongoDB, RiakGraph DatabasesNodes & relationships, both of which can hold key-value pairsExample: AllegroGraph, InfoGrid, Neo4j
Amazon SimpleDBKey-value storeWritten in Erlang, (as is CouchDB)Data is modeled in terms ofDomain, a container of entities,Item, an entity and Attribute and Value, a property of an ItemEventually Consistent, except when ReadConsistent flag specifiedImpressive performance numbers, e.g., .7 sec to store 1 million recordsSQL-like SELECTselect output_listfrom domain_name[where expression] [sort_instructions] [limit limit]
Google DatastorePart of App Engine; also used for internal applicationsUsed for all storageIncorporates a transaction model to ensure high consistencyOptimistic lockingTransactions can failCAP implicationsDatastore isn’t just “eventually consistent”They offer two commercial options (with different prices)Master/Slave Low latency but also lower availabilityAsynchronous replicationHigh ReplicationStrong availability at the cost of higher latency
Some production data, circa 2008.
For more info, see video of Ryan Barrett’s talk at Google I/ODatastore Application at Google
Databases and Key-Value Storeshttp://browsertoolkit.com/fault-tolerance.png
MapReduce Conceptual UnderpinningsProgramming model from Lisp and other functional languages(map square '(1 2 3 4))  (1 4 9 16)(reduce + '(1 4 9 16)) 30 Easy to distributeNice failure/retry semantics
MapReduce Flow
HadoopMapReduceAn Open Source project of the Apache FoundationOther Hadoop-related projects at Apache include:Cassandra™: A scalable multi-master database with no single points of failure.HBase™: A scalable, distributed database that supports structured data storage for large tables.Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.Pig™: A high-level data-flow language and execution framework for parallel computation.See the Apache Hadoop website for more.
Hadoop AvailabilityRun on your laptopRun on your serverRun on Amazon CloudIntroduction at IBM DeveloperWorksRun on Google App EngineIt’s not Hadoop, it’s Google’s implementation of MapReduce
MapReduce Statistics @ GOOGTake-away message:MapReduce is not a “new-fangled technology of the future”It is here, it is proven, use it!
End of an Era?The Relational Model is not necessarily the answerIt was excellent for data processingNot a natural fit forData WarehousesWeb-oriented searchReal-time analytics, andSemi-structured datai.e., Semantic WebSQL is not the answerCoupling between modern programming languages and SQL are “ugly beyond belief”Programming languages have evolved while SQL has remained staticPascalC/C++JavaThe little languages: Python, Perl, PHP, RubyThe end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007A critique of the “one size fits all” assumption in DBMS

More Related Content

PPTX
Hadoop File system (HDFS)
PPTX
Introduction To HBase
PPTX
Distributed DBMS - Unit 5 - Semantic Data Control
PPT
Map reduce in BIG DATA
PPT
Hive(ppt)
PDF
Big Data Architecture
PPTX
Hadoop Tutorial For Beginners
PPTX
Hadoop Architecture
Hadoop File system (HDFS)
Introduction To HBase
Distributed DBMS - Unit 5 - Semantic Data Control
Map reduce in BIG DATA
Hive(ppt)
Big Data Architecture
Hadoop Tutorial For Beginners
Hadoop Architecture

What's hot (20)

PPTX
PDF
Big data unit i
PDF
Big Data technology Landscape
PPS
Oracle Database Overview
PDF
Object oriented databases
PPTX
Distributed DBMS - Unit 6 - Query Processing
PDF
Hadoop Overview & Architecture
 
PPT
Hadoop Map Reduce
PPTX
Hadoop Distributed File System
PPTX
Distributed database management system
PPTX
DBMS OF DATA MODEL Deepika 2
PDF
Hadoop Ecosystem
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
Big Data and Hadoop
PPTX
Apache PIG
PPTX
MapReduce Programming Model
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Lecture6 introduction to data streams
PPTX
Introduction to HDFS
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Big data unit i
Big Data technology Landscape
Oracle Database Overview
Object oriented databases
Distributed DBMS - Unit 6 - Query Processing
Hadoop Overview & Architecture
 
Hadoop Map Reduce
Hadoop Distributed File System
Distributed database management system
DBMS OF DATA MODEL Deepika 2
Hadoop Ecosystem
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Big Data and Hadoop
Apache PIG
MapReduce Programming Model
Architect’s Open-Source Guide for a Data Mesh Architecture
Lecture6 introduction to data streams
Introduction to HDFS
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Ad

Viewers also liked (20)

PDF
Query mechanisms for NoSQL databases
PPTX
CouchDB Map/Reduce
PDF
Query Languages for Document Stores
PPTX
Introduction to NoSQL Databases
PPTX
Amadeus big data
PDF
Chaordic - BigData e MapReduce - Robson Motta
PDF
Drupal 6 Database layer
DOCX
Ejemplos acid
PDF
Spark and MongoDB
PPT
NoSQL Slideshare Presentation
PDF
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
PPTX
Introduction to NoSQL
PPTX
Leveraging SAP, Hadoop, and Big Data to Redefine Business
PDF
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
PPTX
Leveraging SAP, Hadoop, and Big Data to Redefine Business
PPT
The NoSQL Way in Postgres
 
PPT
A testbeszéd
PPT
Arckifejezések
PDF
Nosql data models
PDF
MapReduce入門
Query mechanisms for NoSQL databases
CouchDB Map/Reduce
Query Languages for Document Stores
Introduction to NoSQL Databases
Amadeus big data
Chaordic - BigData e MapReduce - Robson Motta
Drupal 6 Database layer
Ejemplos acid
Spark and MongoDB
NoSQL Slideshare Presentation
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Introduction to NoSQL
Leveraging SAP, Hadoop, and Big Data to Redefine Business
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
Leveraging SAP, Hadoop, and Big Data to Redefine Business
The NoSQL Way in Postgres
 
A testbeszéd
Arckifejezések
Nosql data models
MapReduce入門
Ad

Similar to NoSQL and MapReduce (20)

PPT
Schemaless Databases
PPTX
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
PPT
NO SQL: What, Why, How
PPTX
PDF
Database Revolution - Exploratory Webcast
PDF
Database revolution opening webcast 01 18-12
PDF
Implementation of nosql for robotics
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
PPTX
The future of Big Data tooling
PPTX
عصر کلان داده، چرا و چگونه؟
PPT
Big Data Fundamentals in the Emerging New Data World
PPTX
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
PPTX
No sql distilled-distilled
KEY
NoSQL: Why, When, and How
PPTX
NoSQL powerpoint presentation difference with rdbms
PPT
The World of Structured Storage System
PDF
Next Generation Data Platforms - Deon Thomas
PDF
NoSQL Basics - A Quick Tour
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
Schemaless Databases
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NO SQL: What, Why, How
Database Revolution - Exploratory Webcast
Database revolution opening webcast 01 18-12
Implementation of nosql for robotics
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
The future of Big Data tooling
عصر کلان داده، چرا و چگونه؟
Big Data Fundamentals in the Emerging New Data World
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
No sql distilled-distilled
NoSQL: Why, When, and How
NoSQL powerpoint presentation difference with rdbms
The World of Structured Storage System
Next Generation Data Platforms - Deon Thomas
NoSQL Basics - A Quick Tour
Big data vahidamiri-tabriz-13960226-datastack.ir

More from J Singh (20)

PDF
OpenLSH - a framework for locality sensitive hashing
PPTX
Designing analytics for big data
PDF
Open LSH - september 2014 update
PPTX
PaaS - google app engine
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
PPTX
Facebook Analytics with Elastic Map/Reduce
PPTX
Big Data Laboratory
PPTX
The Hadoop Ecosystem
PPTX
Social Media Mining using GAE Map Reduce
PPTX
High Throughput Data Analysis
PPTX
CS 542 -- Concurrency Control, Distributed Commit
PPTX
CS 542 -- Failure Recovery, Concurrency Control
PPTX
CS 542 -- Query Optimization
PPTX
CS 542 -- Query Execution
PPTX
CS 542 Putting it all together -- Storage Management
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPTX
CS 542 Database Index Structures
PPTX
CS 542 Controlling Database Integrity and Performance
PPTX
CS 542 Overview of query processing
OpenLSH - a framework for locality sensitive hashing
Designing analytics for big data
Open LSH - september 2014 update
PaaS - google app engine
Mining of massive datasets using locality sensitive hashing (LSH)
Data Analytic Technology Platforms: Options and Tradeoffs
Facebook Analytics with Elastic Map/Reduce
Big Data Laboratory
The Hadoop Ecosystem
Social Media Mining using GAE Map Reduce
High Throughput Data Analysis
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Query Optimization
CS 542 -- Query Execution
CS 542 Putting it all together -- Storage Management
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Database Index Structures
CS 542 Controlling Database Integrity and Performance
CS 542 Overview of query processing

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

NoSQL and MapReduce

  • 1. NoSQL databases and MapReduceJ SinghEarly Stage IT
  • 2. What’s so fun about databases?Traditional database discussions talked aboutEmployee recordsBank recordsNow we talk aboutWeb searchData miningThe collective intelligence of tweetsScientific and medical databases
  • 3. How much data can a database hold?The biggest OLTP databases2001: 1.1 – 10.3 TB.2003: 9.1 – 29.2 TB.2005: 17.7 – 100.4 TB.2010: ~2.5 PB.The trend will continueVery large databases bring new unique challenges
  • 4. Historical ContextLate 1990’s.The web scales out.Suddenly, databases not adequate for holding the data being accumulatedScale out vs. Scale up
  • 5. Brewer’s Conjecture (p1)Source: Eric Brewer’s July 2000 PODC KeynoteMain points:Classic “Distributed Systems” don’t workThey focus on computation, not dataDistributing computation is easy, distributing data is hardDBMS research is about ACID (mostly)Atomicity, Consistency, Isolation and DurabilityBut we forfeit “C” and “I” for availability, graceful degradation and performance – this tradeoff is fundamentalBASEBasically AvailableSoft-stateEventual Consistency
  • 6. Brewer’s Conjecture (p2)BASEWeak consistencystale data OKAvailability firstBest effortApproximate answers OKAggressive (optimistic)Simpler!FasterEasier evolutionACIDStrong consistencyIsolationFocus on “commit”Nested transactionsAvailability?Conservative (pessimistic)Difficult evolution (e.g. schema)But I think it’s a spectrumEric Brewer
  • 7. CAP TheoremSince then,Brewer’s conjecture formally proved: Gilbert & Lynch, 2002Thus Brewer’s conjecture became the CAP theorem……and contributed to the birth of the NoSQL movementBut the theory is not settledWhile http://nosql-database.org/ lists 122 NoSQL databases
  • 8. What is NoSQL?Stands for Not Only SQLClass of non-relational data storage systemsUsually do not require a fixed table schema nor do they use the concept of joinsAll NoSQL offerings relax one or more of the ACID properties
  • 9. Forces at WorkThree major papers were the seeds of the NoSQL movementCAP Theorem (discussed above)BigTable(Google)Dynamo (Amazon)Some types of data could not be modeled well in RDBMSDocument Storage and IndexingRecursive Data and GraphsTime Series DataGenomics Data
  • 10. NoSQL DatabasesKey-Value StoresA storage system that stores values, indexed by a key.Example: Voldemort, Dynomite, Tokyo CabinetBigTable Clones (aka "ColumnFamily")A tabular model where each row (at least in theory) can have an individual configuration of columns.Example: HBase, Hypertable, Cassandra, Amazon SimpleDB
  • 11. NoSQL DatabasesDocument DatabasesCollections of documents, which contain key-value collections (called "documents")Example: CouchDB, MongoDB, RiakGraph DatabasesNodes & relationships, both of which can hold key-value pairsExample: AllegroGraph, InfoGrid, Neo4j
  • 12. Amazon SimpleDBKey-value storeWritten in Erlang, (as is CouchDB)Data is modeled in terms ofDomain, a container of entities,Item, an entity and Attribute and Value, a property of an ItemEventually Consistent, except when ReadConsistent flag specifiedImpressive performance numbers, e.g., .7 sec to store 1 million recordsSQL-like SELECTselect output_listfrom domain_name[where expression] [sort_instructions] [limit limit]
  • 13. Google DatastorePart of App Engine; also used for internal applicationsUsed for all storageIncorporates a transaction model to ensure high consistencyOptimistic lockingTransactions can failCAP implicationsDatastore isn’t just “eventually consistent”They offer two commercial options (with different prices)Master/Slave Low latency but also lower availabilityAsynchronous replicationHigh ReplicationStrong availability at the cost of higher latency
  • 14. Some production data, circa 2008.
  • 15. For more info, see video of Ryan Barrett’s talk at Google I/ODatastore Application at Google
  • 16. Databases and Key-Value Storeshttp://browsertoolkit.com/fault-tolerance.png
  • 17. MapReduce Conceptual UnderpinningsProgramming model from Lisp and other functional languages(map square '(1 2 3 4))  (1 4 9 16)(reduce + '(1 4 9 16)) 30 Easy to distributeNice failure/retry semantics
  • 19. HadoopMapReduceAn Open Source project of the Apache FoundationOther Hadoop-related projects at Apache include:Cassandra™: A scalable multi-master database with no single points of failure.HBase™: A scalable, distributed database that supports structured data storage for large tables.Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.Pig™: A high-level data-flow language and execution framework for parallel computation.See the Apache Hadoop website for more.
  • 20. Hadoop AvailabilityRun on your laptopRun on your serverRun on Amazon CloudIntroduction at IBM DeveloperWorksRun on Google App EngineIt’s not Hadoop, it’s Google’s implementation of MapReduce
  • 21. MapReduce Statistics @ GOOGTake-away message:MapReduce is not a “new-fangled technology of the future”It is here, it is proven, use it!
  • 22. End of an Era?The Relational Model is not necessarily the answerIt was excellent for data processingNot a natural fit forData WarehousesWeb-oriented searchReal-time analytics, andSemi-structured datai.e., Semantic WebSQL is not the answerCoupling between modern programming languages and SQL are “ugly beyond belief”Programming languages have evolved while SQL has remained staticPascalC/C++JavaThe little languages: Python, Perl, PHP, RubyThe end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007A critique of the “one size fits all” assumption in DBMS
  • 23. Take AwaysNoSQL databases are a solution to web-scale problemsA lot of data lives outside relational databasesWith SQLnix.org, we are starting a local resource for NoSQL database knowledgeTaking on projects to apply the technology, not just read about it.If you want to work on it, please contact us.Thanks