If you are looking for some of the most influential research papers that revolutionised the way how we gather, aggregate, analyze and store increasing volumes of data in a short span of 10 years, you are in the right place! These papers were shortlisted, based on recommendations by big data enthusiasts and experts around the globe from various social media channels. In case we’ve missed out any important paper, please let us know.
This paper presents MapReduce, a programming model and its implementation for large-scale distributed clusters. The main idea is to have a general execution model for codes that need to process a large amount of data over hundreds of machines.
It presents Google File System, a scalable distributed file system for large distributed data-intensive applications, which provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
This paper presents the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable.
This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.
Chubby is a distributed lock service; it does a lot of the hard parts of building distributed systems and provides its users with a familiar interface (writing files, taking a lock, file permissions). The paper describes it, focusing on the API rather than the implementation details.
This paper describes the design and initial implementation of Chukwa, a data collection system for monitoring and analyzing large distributed systems. Chukwa is built on top of Hadoop, an open source distributed filesystem and MapReduce implementation, and inherits Hadoop’s scalability and robustness.
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.
There are two schools of thought regarding what technology to use for data analysis. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. This paper explores the feasibility of building a hybrid system.
This paper outlines the S4 architecture in detail, describes various applications, including real-life deployments, to show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
This paper describes the architecture and implementation of Dremel, a scalable, interactive ad-hoc query system for analysis of read-only nested data, and explains how it complements MapReduce-based computing.
Percolator is a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. This indexing system based on incremental processing replaced Google’s batch-based indexing system.
This paper presents a computational model suitable to solve many practical computing problems that concerns large graphs.
It explains about Spanner, Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and sup-port externally-consistent distributed transactions.
Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.
This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them.
This paper summarizes twelve key lessons that machine learning researchers and practitioners have learned, which include pitfalls to avoid, important issues to focus on, and answers to common questions.
This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, it combines several ingredients, which form the basis of the modern practice of random forests.
Written by EF Codd in 1970, this paper was a breakthrough in Relational Data Base systems. He was the man who first conceived of the relational model for database management.
The paper focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations.
This paper describes Megastore, a storage system developed to blend the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way.
This paper describes Haystack, an object storage system optimized for Facebook’s Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data.
This paper focuses on applications that reuse a working set of data across multiple parallel operations and proposes a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce.
This paper presents Twitter’s production logging infrastructure and its evolution from application-specific logging to a unified “client events” log format, where messages are captured in common, well-formatted, flexible Thrift messages.
F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases.
This paper presents MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers.
This paper presents a new approach that gives more control to data scientists to carefully choose from a huge variety of sampling strategies in a domain-specific manner.
This is paper one of the most referenced documents in the world of Big Data. It describes current and potential applications of Big Data.
This paper summarizes the insights of the Eighteenth Annual Roundtable on Information Technology, which sought to understand the implications of the emergence of “Big Data” and new techniques of inferential analysis.
This paper provides six guidelines on implementing Big Data Analytics. It helps you take the first steps toward achieving a lasting competitive edge with analytics.