Arch Linux BasedManjaro

How To Install Apache Spark on Manjaro

Install Apache Spark on Manjaro

In this tutorial, we will show you how to install Apache Spark on Manjaro. Apache Spark is a powerful open-source cluster computing framework designed for large-scale data processing. It has gained immense popularity in the big data ecosystem due to its speed, ease of use, and versatility. Whether you’re working with batch processing, real-time streaming, machine learning, or SQL workloads, Spark provides a unified platform to handle it all efficiently.

This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo’ to the commands to get root privileges. I will show you the step-by-step installation of the Apache Spark on a Manjaro Linux.

Prerequisites

  • A server or desktop running one of the following operating systems: Manjaro, and other Arch-based distributions.
  • It’s recommended that you use a fresh OS install to prevent any potential issues.
  • SSH access to the server (or just open Terminal if you’re on a desktop).
  • A stable internet connection is crucial for downloading and installing packages. Verify your connection before proceeding.
  • Access to a Manjaro Linux system with a non-root sudo user or root user.

Install Apache Spark on Manjaro

Step 1. Before installing any new software, it’s a good practice to update your package database. This ensures that you’re installing the latest version of the software and that all dependencies are up to date. To update the package database, run the following command in the terminal:

sudo pacman -Syu

Step 2. Installing Java.

Apache Spark is written in Scala, a language that runs on the Java Virtual Machine (JVM). Therefore, having the Java Development Kit (JDK) installed is a prerequisite for running Spark. Here’s how you can install OpenJDK on Manjaro:

sudo pacman -S jdk-openjdk

Verify the Java installation by checking the version:

java -version

Step 3. Installing Scala.

While not strictly required, it’s highly recommended to have Scala installed, as it’s the primary language used for writing Spark applications. Scala provides a more concise and expressive syntax compared to Java, making it easier to work with Spark. Install Scala using the package manager:

sudo pacman -S scala

Verify the Scala installation:

scala -version

Step 4. Installing Apache Spark on Manjaro.

The first step in installing Apache Spark is to download the appropriate distribution from the official Apache Spark website. You can choose between the pre-built package with or without Hadoop dependencies.

For this guide, we’ll download the package without Hadoop dependencies, as Manjaro Linux comes with its own package manager for handling dependencies. Use the following command to download the latest version of Apache Spark:

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz

Once you’ve downloaded the Spark package, it’s time to extract and set up the necessary configurations:

sudo mkdir /opt/spark

Extract the downloaded package to the Spark home directory:

sudo tar -xvzf spark-3.5.1-bin-without-hadoop.tgz -C /opt/spark --strip-components=1

Set up the environment variables by creating a new file /etc/profile.d/spark.sh with the following content:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

Source the file to apply the changes:

source /etc/profile.d/spark.sh

Create or edit the spark-defaults.conf file in the $SPARK_HOME/conf directory to configure Spark’s settings. Here are some common configurations:

spark.driver.host                 localhost
spark.eventLog.enabled            true
spark.eventLog.dir                /tmp/spark-events
spark.history.fs.logDirectory     /tmp/spark-events

Verify the Spark installation by running the following command:

spark-shell

Step 5. Running Spark Shell.

The Spark Shell is an interactive environment that allows you to explore and experiment with Spark’s features. It’s a great way to get started with Spark and test your code snippets before integrating them into larger applications. To start the Spark Shell, simply run the following command:

spark-shell

Once the shell is up and running, you can start creating and manipulating Resilient Distributed Datasets (RDDs), which are the core data structures in Spark. Here’s a simple example:

val data = 1 to 1000000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect().foreach(println)

This code creates an RDD distData from a range of numbers, filters out values greater than or equal to 10, and prints the remaining values to the console.

Step 6. Running Spark Applications.

While the Spark Shell is great for testing and exploration, you’ll eventually want to develop and run full-fledged Spark applications. These applications can be written in Scala, Java, Python, or R, and can be packaged and submitted to a Spark cluster for execution. Here’s an example of a simple Scala Spark application:

import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val spark = SparkSession.builder()
      .appName("Simple App")
      .getOrCreate()

    val data = Seq(("John", 30), ("Jane", 25), ("Bob", 35))
    val rdd = spark.sparkContext.parallelize(data)
    val result = rdd.map(x => (x._1, x._2 + 10))

    result.foreach(println)

    spark.stop()
  }
}

This application creates a SparkSession, parallelizes some data, performs a simple transformation (adding 10 to each person’s age), and prints the result.

To package and run this application, you can use build tools like sbt or Maven. Once packaged, you can submit the application to a Spark cluster using the spark-submit command:

spark-submit --class "SimpleApp" --master local[*] /path/to/your/app.jar

Step 7. Installing PySpark.

PySpark is the Python API for Apache Spark, allowing you to write Spark applications using the Python programming language. It provides a seamless integration between Python and Spark, making it easier to leverage existing Python libraries and tools in your Spark workflows. To install PySpark on Manjaro, you can use the Python package manager, pip:

pip install pyspark

You may also want to install additional Python libraries that integrate well with PySpark, such as pandas, NumPy, and scikit-learn. Once installed, you can start the PySpark shell by running:

pyspark

Here’s a simple example of using PySpark to create an RDD and perform a transformation:

data = range(1, 1000001)
distData = sc.parallelize(data)
result = distData.filter(lambda x: x < 10).collect()
print(result)

Step 8. Spark Standalone Mode.

While running Spark applications locally is convenient for testing and development, you’ll likely want to deploy your applications to a Spark cluster for production workloads. Spark provides a Standalone mode for this purpose, which allows you to set up a Spark cluster on a set of dedicated machines.

To run Spark in Standalone mode, you’ll need to start a Spark Master process and one or more Spark Worker processes. Here’s how you can do it:

  1. Start the Spark Master process:
$SPARK_HOME/sbin/start-master.sh
  1. Start one or more Spark Worker processes, specifying the Master URL:
$SPARK_HOME/sbin/start-worker.sh spark://MASTER_HOST:7077

Replace MASTER_HOST with the hostname or IP address of the machine running the Spark Master.

  1. Configure your Spark applications to run in Standalone mode by specifying the --master option when submitting them:
spark-submit --master spark://MASTER_HOST:7077 --class "MyApp" /path/to/my-app.jar

This will submit your Spark application to the Standalone cluster, allowing it to take advantage of the distributed computing resources.

Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial to install the latest version of Apache Spark on the Manjaro system. For additional help or useful information, we recommend you check the official Apache website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t

r00t is an experienced Linux enthusiast and technical writer with a passion for open-source software. With years of hands-on experience in various Linux distributions, r00t has developed a deep understanding of the Linux ecosystem and its powerful tools. He holds certifications in SCE and has contributed to several open-source projects. r00t is dedicated to sharing her knowledge and expertise through well-researched and informative articles, helping others navigate the world of Linux with confidence.
Back to top button