What is Spark?
Spark is an open source cluster computing system that aims to make data analytics
fast — both fast to run and fast to write.
To run programs faster, Spark provides primitives for in-memory cluster
computing: your job can load data into memory and query it repeatedly much quicker than with
disk-based systems like Hadoop MapReduce.
To make programming faster, Spark integrates into the
Scala language, letting you manipulate distributed
datasets like local collections. You can also use Spark interactively to query
big data from the Scala interpreter.
What can it do?
Spark was initially developed for two applications where keeping data in memory
helps: iterative
algorithms, which are common in machine learning, and interactive
data mining. In both cases, Spark can outperform Hadoop MapReduce by 30x.
However, you can use Spark for general data processing too.
Check out our example jobs.
Spark runs on the Apache Mesos cluster manager, letting
it coexist with Hadoop. It can also read any data source supported by Hadoop.
Who uses it?
Spark was developed in the UC Berkeley AMP Lab.
It's used by several groups of researchers at Berkeley to run large-scale applications such
as spam filtering, natural language processing and road traffic prediction. It's also used to accelerate
data analytics at Conviva,
Klout, and Quantifind,
and other companies.
Spark is open source under a BSD license,
so download it to check it out!