How To Install Apache Spark on Debian 12

4 minutes read

In this tutorial, we will show you how to install Apache Spark on Debian 12. For those of you who didn’t know, Apache Spark has revolutionized big data processing, becoming the go-to solution for data engineers and analysts worldwide. Its lightning-fast processing speed and robust capabilities make it an essential tool for handling vast amounts of data efficiently. If you’re looking to harness the power of Apache Spark on your Debian 12 (Bookworm) system, you’ve come to the right place.

This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo‘ to the commands to get root privileges. I will show you a step-by-step install of the Apache Spark on a Debian 12 (Bookworm).

Prerequisites

A server running one of the following operating systems: Debian 12 (Bookworm).
It’s recommended that you use a fresh OS install to prevent any potential issues.
SSH access to the server (or just open Terminal if you’re on a desktop).
An active internet connection. You’ll need an internet connection to download the necessary packages and dependencies for Apache Spark.
A non-root sudo user or access to the root user. We recommend acting as a non-root sudo user, however, as you can harm your system if you’re not careful when acting as the root.

Install Apache Spark on Debian 12 Bookworm

Step 1. Before we install any software, it’s important to make sure your system is up to date by running the following apt commands in the terminal:

sudo apt update
sudo apt upgrade

This command will refresh the repository, allowing you to install the latest versions of software packages.

Step 2. Installing Java.

Now, let’s install the required dependencies, including OpenJDK (Java Development Kit), which is a prerequisite for Apache Spark:

sudo apt install default-jdk

Confirm that Java is installed correctly by checking its version:

java -version

Step 3. Installing Apache Spark on Debian 12.

With your system prepared, it’s time to obtain the latest version of Apache Spark and set up the foundation for your big data journey:

wget https://www.apache.org/dyn/closer.lua/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

Unpack the downloaded package using the following command:

tar xvf spark-<spark-version>-bin-hadoop3.tgz

Next, move the extracted Spark directory to a location of your choice. For example, let’s move it to the /opt/ directory:

sudo mv spark-<spark-version>-bin-hadoop3 /opt/spark

To access Spark commands from anywhere on your system, we need to set up some environment variables. Open the .bashrc file using a text editor:

nano ~/.bashrc

Add the following lines at the end of the file to set the required environment variables:

# Apache Spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

Save the changes and exit the text editor. To apply the changes to your current session, run:

source ~/.bashrc

Ensure that Apache Spark is correctly installed by running a simple test:

spark-shell

If the installation was successful, the Spark shell should launch, and you’ll see a prompt like this:

scala>

Step 4. Running Apache Spark Locally.

With Apache Spark installed and configured, let’s run it locally to process some data. Run the following command to start the Spark shell:

spark-shell

Basic DataFrame Operations:

Let’s perform some basic operations using Spark’s DataFrame API:

Reading a CSV File:

To read a CSV file into a DataFrame, use the following code snippet:

val data = spark.read.format("csv")
.option("header", "true")
.load("/path/to/your/csv/file.csv")

Displaying Data:

To display the DataFrame content, simply type the variable name and hit Enter:

data.show()

Performing Operations:

You can perform various transformations on the DataFrame, such as filtering, grouping, and aggregating, using Spark’s functional programming APIs.

Example: Let’s calculate the average value of a column named “price”:

val avgPrice = data.agg(avg("price")).collect()(0)(0)
println(s"The average price is: $avgPrice")

Step 5. Setting up a Spark Cluster (Optional).

While Spark can be run locally, its true power shines when deployed on a cluster. Setting up a Spark cluster allows you to distribute data processing tasks across multiple nodes, significantly improving performance and scalability.

Preparing Nodes: Ensure all nodes in your cluster have Java and Spark installed with the same version. Copy the Spark installation directory to each node.
Configuring Spark on the Master Node: On the master node, navigate to the Spark configuration directory:

cd /opt/spark/conf

Copy the spark-env.sh.template file to spark-env.sh:

cp spark-env.sh.template spark-env.sh

Edit the spark-env.sh file to configure the master node and other settings:

nano spark-env.sh

Add the following lines to specify the IP address of the master node and allocate memory for the Spark driver and workers:

export SPARK_MASTER_HOST=<master-ip>
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=2g

Save the changes and exit the text editor.

Step 6. Launching the Master Node.

Start the Spark master node by running the following command:

start-master.sh

Access the Spark web UI by opening a web browser and navigating to:

http://<master-ip>:8080

Step 7. Troubleshooting Tips.

Installing and configuring Apache Spark may encounter some challenges. Here are some common issues and troubleshooting tips:

Java Version Conflict: If you encounter Java version issues, ensure that you have installed OpenJDK (Java Development Kit) version 8 or above and set the JAVA_HOME environment variable correctly.
Spark Shell Failure: If the Spark shell fails to launch, check the environment variables, and ensure Spark’s installation directory is correctly set in your system’s PATH.
Port Conflicts: If the Spark web UI doesn’t load or shows errors related to port conflicts, verify that the specified ports (e.g., 8080, 7077) are not in use by other services on your system.

Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial to install Apache Spark on Debian 12 Bookworm. For additional help or useful information, we recommend you check the official Apache website.

VPS Manage Service Offer

If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

Install Apache Spark on Debian 12 Bookworm

r00t