
In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied outside the document itself through the use of pair RDDs. Further more, one can even include/exclude what parts of the data are sent back to Elasticsearch. As explained above, through the various mapping options one can customize these parameters so that their values are extracted from their belonging document. Save each object based on its resource pattern, media_type in this exampleĮlasticsearch allows each document to have its own metadata. As Spark has multiple deployment modes, this can translate to the target classpath, whether it is on only one node (as is the case with the local mode - which will be used through-out the documentation) or per-node depending on the desired infrastructure. Just like other libraries, elasticsearch-hadoop needs to be available in Spark’s classpath. Spark 2.0 is supported in elasticsearch-hadoop since version 5.0 Installation edit As opposed to the rest of the libraries mentioned in this documentation, Apache Spark is computing framework that is not tied to Map/Reduce itself however it does integrate with Hadoop, mainly to HDFS.Įlasticsearch-hadoop allows Elasticsearch to be used in Spark in two ways: through the dedicated support available since 2.1 or through the Map/Reduce bridge since 2.0.

Spark provides fast iterative/functional-like capabilities over large data sets, typically by caching data in memory. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs.

Apache Spark is a fast and general-purpose cluster computing system.
