If you are looking for some of the most influential research papers that revolutionised the way how we gather, aggregate, analyze and store increasing volumes of data in a short span of 10 years, you are in the right place! These papers were shortlisted, based on recommendations by big data enthusiasts and experts around the globe from various social media channels. In case we’ve missed out any important paper, please let us know.
MapReduce: Simplified Data Processing on Large Clusters
This paper presents MapReduce, a programming model and its implementation for large-scale distributed clusters. The main idea is to have a general execution model for codes that need to process a large amount of data over hundreds of machines.
It presents Google File System, a scalable distributed file system for large distributed data-intensive applications, which provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
Bigtable: A Distributed Storage System for Structured Data
This paper presents the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable.
Dynamo: Amazon’s Highly Available Key-value Store
This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.
The Chubby lock service for loosely-coupled distributed systems
Chubby is a distributed lock service; it does a lot of the hard parts of building distributed systems and provides its users with a familiar interface (writing files, taking a lock, file permissions). The paper describes it, focusing on the API rather than the implementation details.
Chukwa: A large-scale monitoring system
This paper describes the design and initial implementation of Chukwa, a data collection system for monitoring and analyzing large distributed systems. Chukwa is built on top of Hadoop, an open source distributed filesystem and MapReduce implementation, and inherits Hadoop’s scalability and robustness.
Cassandra – A Decentralized Structured Storage System
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
There are two schools of thought regarding what technology to use for data analysis. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. This paper explores the feasibility of building a hybrid system.
S4: Distributed Stream Computing Platform.
This paper outlines the S4 architecture in detail, describes various applications, including real-life deployments, to show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
Dremel: Interactive Analysis of Web-Scale Datasets
This paper describes the architecture and implementation of Dremel, a scalable, interactive ad-hoc query system for analysis of read-only nested data, and explains how it complements MapReduce-based computing.
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Percolator is a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. This indexing system based on incremental processing replaced Google’s batch-based indexing system.
Pregel: A System for Large-Scale Graph Processing
This paper presents a computational model suitable to solve many practical computing problems that concerns large graphs.
Spanner: Google’s Globally-Distributed Database
It explains about Spanner, Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and sup-port externally-consistent distributed transactions.
Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.
The PageRank Citation Ranking: Bringing Order to the Web
This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them.
A Few Useful Things to Know about Machine Learning
This paper summarizes twelve key lessons that machine learning researchers and practitioners have learned, which include pitfalls to avoid, important issues to focus on, and answers to common questions.
This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, it combines several ingredients, which form the basis of the modern practice of random forests.
A Relational Model of Data for Large Shared Data Banks
Written by EF Codd in 1970, this paper was a breakthrough in Relational Data Base systems. He was the man who first conceived of the relational model for database management.
Map-Reduce for Machine Learning on Multicore
The paper focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services
This paper describes Megastore, a storage system developed to blend the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way.
Finding a needle in Haystack: Facebook’s photo storage
This paper describes Haystack, an object storage system optimized for Facebook’s Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data.
Spark: Cluster Computing with Working Sets
This paper focuses on applications that reuse a working set of data across multiple parallel operations and proposes a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce.
The Unified Logging Infrastructure for Data Analytics at Twitter
This paper presents Twitter’s production logging infrastructure and its evolution from application-specific logging to a unified “client events” log format, where messages are captured in common, well-formatted, flexible Thrift messages.
F1: A Distributed SQL Database That Scales
F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases.
MLbase: A Distributed Machine-learning System
This paper presents MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers.
Scalable Progressive Analytics on Big Data in the Cloud
This paper presents a new approach that gives more control to data scientists to carefully choose from a huge variety of sampling strategies in a domain-specific manner.
Big data: The next frontier for innovation, competition, and productivity
This is paper one of the most referenced documents in the world of Big Data. It describes current and potential applications of Big Data.
The Promise and Peril of Big Data
This paper summarizes the insights of the Eighteenth Annual Roundtable on Information Technology, which sought to understand the implications of the emergence of “Big Data” and new techniques of inferential analysis.
TDWI Checklist Report: Big Data Analytics
This paper provides six guidelines on implementing Big Data Analytics. It helps you take the first steps toward achieving a lasting competitive edge with analytics.
This post was orginally published on bigdata-madesimple