How To Install Apache Spark on Manjaro
In this tutorial, we will show you how to install Apache Spark on Manjaro. Apache Spark is a powerful open-source cluster computing framework designed for large-scale data processing. It has gained immense popularity in the big data ecosystem due to its speed, ease of use, and versatility. Whether you’re working with batch processing, real-time streaming, machine learning, or SQL workloads, Spark provides a unified platform to handle it all efficiently.
This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo’ to the commands to get root privileges. I will show you the step-by-step installation of the Apache Spark on a Manjaro Linux.
Prerequisites
- A server or desktop running one of the following operating systems: Manjaro, and other Arch-based distributions.
- It’s recommended that you use a fresh OS install to prevent any potential issues.
- SSH access to the server (or just open Terminal if you’re on a desktop).
- A stable internet connection is crucial for downloading and installing packages. Verify your connection before proceeding.
- Access to a Manjaro Linux system with a non-root sudo user or root user.
Install Apache Spark on Manjaro
Step 1. Before installing any new software, it’s a good practice to update your package database. This ensures that you’re installing the latest version of the software and that all dependencies are up to date. To update the package database, run the following command in the terminal:
sudo pacman -Syu
Step 2. Installing Java.
Apache Spark is written in Scala, a language that runs on the Java Virtual Machine (JVM). Therefore, having the Java Development Kit (JDK) installed is a prerequisite for running Spark. Here’s how you can install OpenJDK on Manjaro:
sudo pacman -S jdk-openjdk
Verify the Java installation by checking the version:
java -version
Step 3. Installing Scala.
While not strictly required, it’s highly recommended to have Scala installed, as it’s the primary language used for writing Spark applications. Scala provides a more concise and expressive syntax compared to Java, making it easier to work with Spark. Install Scala using the package manager:
sudo pacman -S scala
Verify the Scala installation:
scala -version
Step 4. Installing Apache Spark on Manjaro.
The first step in installing Apache Spark is to download the appropriate distribution from the official Apache Spark website. You can choose between the pre-built package with or without Hadoop dependencies.
For this guide, we’ll download the package without Hadoop dependencies, as Manjaro Linux comes with its own package manager for handling dependencies. Use the following command to download the latest version of Apache Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz
Once you’ve downloaded the Spark package, it’s time to extract and set up the necessary configurations:
sudo mkdir /opt/spark
Extract the downloaded package to the Spark home directory:
sudo tar -xvzf spark-3.5.1-bin-without-hadoop.tgz -C /opt/spark --strip-components=1
Set up the environment variables by creating a new file /etc/profile.d/spark.sh
with the following content:
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin
Source the file to apply the changes:
source /etc/profile.d/spark.sh
Create or edit the spark-defaults.conf
file in the $SPARK_HOME/conf
directory to configure Spark’s settings. Here are some common configurations:
spark.driver.host localhost spark.eventLog.enabled true spark.eventLog.dir /tmp/spark-events spark.history.fs.logDirectory /tmp/spark-events
Verify the Spark installation by running the following command:
spark-shell
Step 5. Running Spark Shell.
The Spark Shell is an interactive environment that allows you to explore and experiment with Spark’s features. It’s a great way to get started with Spark and test your code snippets before integrating them into larger applications. To start the Spark Shell, simply run the following command:
spark-shell
Once the shell is up and running, you can start creating and manipulating Resilient Distributed Datasets (RDDs), which are the core data structures in Spark. Here’s a simple example:
val data = 1 to 1000000 val distData = sc.parallelize(data) distData.filter(_ < 10).collect().foreach(println)
This code creates an RDD distData
from a range of numbers, filters out values greater than or equal to 10, and prints the remaining values to the console.
Step 6. Running Spark Applications.
While the Spark Shell is great for testing and exploration, you’ll eventually want to develop and run full-fledged Spark applications. These applications can be written in Scala, Java, Python, or R, and can be packaged and submitted to a Spark cluster for execution. Here’s an example of a simple Scala Spark application:
import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val spark = SparkSession.builder() .appName("Simple App") .getOrCreate() val data = Seq(("John", 30), ("Jane", 25), ("Bob", 35)) val rdd = spark.sparkContext.parallelize(data) val result = rdd.map(x => (x._1, x._2 + 10)) result.foreach(println) spark.stop() } }
This application creates a SparkSession, parallelizes some data, performs a simple transformation (adding 10 to each person’s age), and prints the result.
To package and run this application, you can use build tools like sbt or Maven. Once packaged, you can submit the application to a Spark cluster using the spark-submit
command:
spark-submit --class "SimpleApp" --master local[*] /path/to/your/app.jar
Step 7. Installing PySpark.
PySpark is the Python API for Apache Spark, allowing you to write Spark applications using the Python programming language. It provides a seamless integration between Python and Spark, making it easier to leverage existing Python libraries and tools in your Spark workflows. To install PySpark on Manjaro, you can use the Python package manager, pip:
pip install pyspark
You may also want to install additional Python libraries that integrate well with PySpark, such as pandas, NumPy, and scikit-learn. Once installed, you can start the PySpark shell by running:
pyspark
Here’s a simple example of using PySpark to create an RDD and perform a transformation:
data = range(1, 1000001) distData = sc.parallelize(data) result = distData.filter(lambda x: x < 10).collect() print(result)
Step 8. Spark Standalone Mode.
While running Spark applications locally is convenient for testing and development, you’ll likely want to deploy your applications to a Spark cluster for production workloads. Spark provides a Standalone mode for this purpose, which allows you to set up a Spark cluster on a set of dedicated machines.
To run Spark in Standalone mode, you’ll need to start a Spark Master process and one or more Spark Worker processes. Here’s how you can do it:
- Start the Spark Master process:
$SPARK_HOME/sbin/start-master.sh
- Start one or more Spark Worker processes, specifying the Master URL:
$SPARK_HOME/sbin/start-worker.sh spark://MASTER_HOST:7077
Replace MASTER_HOST
with the hostname or IP address of the machine running the Spark Master.
- Configure your Spark applications to run in Standalone mode by specifying the
--master
option when submitting them:
spark-submit --master spark://MASTER_HOST:7077 --class "MyApp" /path/to/my-app.jar
This will submit your Spark application to the Standalone cluster, allowing it to take advantage of the distributed computing resources.
Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial to install the latest version of Apache Spark on the Manjaro system. For additional help or useful information, we recommend you check the official Apache website.