UbuntuUbuntu Based

How To Install Apache Spark on Ubuntu 24.04 LTS

Install Apache Spark on Ubuntu 24.04

Apache Spark stands at the forefront of big data analytics, offering unparalleled speed and versatility. It’s an open-source, distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark’s ability to process data in-memory makes it significantly faster than traditional big data tools, especially for iterative algorithms and interactive data analysis.

Ubuntu 24.04 LTS, the latest long-term support release from Canonical, provides a stable and secure foundation for running Apache Spark. This version boasts improved performance, enhanced security features, and long-term support, making it an ideal choice for deploying Spark in production environments.

In this comprehensive guide, we’ll walk through the step-by-step process of installing Apache Spark on Ubuntu 24.04 LTS. Whether you’re setting up a development environment or preparing for a large-scale deployment, this tutorial will equip you with the knowledge to get Spark up and running smoothly.

Prerequisites

Before diving into the installation process, ensure your system meets the following requirements:

System Requirements

  • A machine running Ubuntu 24.04 LTS (server or desktop edition)
  • At least 4GB of RAM (8GB or more recommended for optimal performance)
  • Minimum 10GB of free disk space
  • An active internet connection for downloading packages

Software Prerequisites

  • Administrative privileges (sudo access)
  • Basic familiarity with the Linux command line

Additionally, you’ll need to install the following packages and tools:

  • Java Development Kit (JDK)
  • Curl or wget for downloading files
  • Tar for extracting compressed files

Most of these tools come pre-installed on Ubuntu, but we’ll cover their installation in the following steps to ensure everything is set up correctly.

Step 1: Update System Packages

Before installing any new software, it’s crucial to update your system’s package list and upgrade existing packages. This ensures you have the latest security updates and bug fixes.

Open a terminal and run the following command:

sudo apt update && sudo apt -y upgrade

This command updates the package list and upgrades all installed packages to their latest versions. The -y flag automatically answers “yes” to any prompts, streamlining the upgrade process.

Step 2: Install Java Development Kit (JDK)

Apache Spark requires Java to run. While Spark 3.x supports Java 8 and later versions, it’s recommended to use Java 11 for optimal compatibility and performance.

To install the default JDK (which is OpenJDK 11 in Ubuntu 24.04 LTS), run:

sudo apt install default-jdk -y

After the installation completes, verify the Java version by running:

java -version

You should see output similar to this:

openjdk version "11.0.XX" 20XX-XX-XX
OpenJDK Runtime Environment (build 11.0.XX+XX-Ubuntu-XXubuntuX.XX.XX)
OpenJDK 64-Bit Server VM (build 11.0.XX+XX-Ubuntu-XXubuntuX.XX.XX, mixed mode, sharing)

If you see this output, Java is successfully installed on your system.

Step 3: Download Apache Spark

Now that Java is installed, we can proceed to download Apache Spark. Visit the official Apache Spark website to find the latest stable version. At the time of writing, the latest version is 3.5.2, but you should check for the most recent release.

To download Spark, use the wget command:

wget https://dlcdn.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz

Replace 3.5.2 with the latest version number if a newer release is available. This command downloads the Spark binary package pre-built for Apache Hadoop 3.x.

Step 4: Extract and Move Apache Spark

Once the download is complete, extract the downloaded file using the tar command:

tar xvf spark-3.5.2-bin-hadoop3.tgz

This command will create a new directory named spark-3.5.2-bin-hadoop3 in your current location.

Next, move the extracted directory to /opt/spark for easier management and access:

sudo mv spark-3.5.2-bin-hadoop3 /opt/spark

This step centralizes the Spark installation and makes it accessible to all users on the system.

Step 5: Configure Environment Variables

To use Spark from any location on your system, you need to set up environment variables. Edit the .bashrc file in your home directory:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

These lines set the SPARK_HOME variable and add Spark’s binary directories to your system’s PATH.

Save the file and exit the editor (in nano, press Ctrl+X, then Y, then Enter). To apply these changes to your current session, run:

source ~/.bashrc

Step 6: Start Apache Spark

With the environment variables set, you can now start the Spark master node. Run the following command:

$SPARK_HOME/sbin/start-master.sh

This script starts the Spark master process. You should see an output indicating that the master has started successfully.

To access the Spark web UI, open a web browser and navigate to:

http://localhost:8080

If you’re accessing Spark from a remote machine, replace localhost with your server’s IP address or domain name.

The Spark UI provides valuable information about your Spark cluster, including worker nodes, running applications, and resource usage.

Step 7: Verify Installation

To ensure Spark is installed correctly and functioning as expected, start the Spark shell:

spark-shell

You should see a welcome message and a Scala prompt. This interactive shell allows you to run Spark commands and test your installation.

Try a simple Spark operation to verify everything is working:

val data = spark.range(1, 1000)
data.count()

This creates a dataset of numbers from 1 to 999 and counts them. If you see the result 999, congratulations! Your Spark installation is working correctly.

Exit the Spark shell by typing :quit or pressing Ctrl+D.

Step 8: Configure Systemd Service (Optional)

For production environments, it’s beneficial to set up Spark as a system service. This ensures Spark starts automatically on boot and can be easily managed using systemd commands.

Create a new systemd service file:

sudo nano /etc/systemd/system/spark-master.service

Add the following content to the file:

[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
User=ubuntu
Group=ubuntu
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Replace ubuntu with your system username if different.

Save the file and exit the editor. Then, reload the systemd daemon to recognize the new service:

sudo systemctl daemon-reload

Now you can start and enable the Spark master service:

sudo systemctl start spark-master
sudo systemctl enable spark-master

These commands start the Spark master immediately and configure it to start automatically on system boot.

Troubleshooting Common Issues

Even with careful installation, you might encounter some issues. Here are solutions to common problems:

Java Not Found

If you receive a “Java not found” error, ensure Java is installed correctly:

java -version

If Java isn’t recognized, revisit Step 2 and reinstall the JDK.

Spark Commands Not Recognized

If Spark commands aren’t recognized, check your PATH settings:

echo $PATH

Ensure /opt/spark/bin and /opt/spark/sbin are in the output. If not, review Step 5 and update your .bashrc file.

Port Conflicts

If Spark fails to start due to port conflicts, you can change the default ports in $SPARK_HOME/conf/spark-defaults.conf:

spark.master.port 7077
spark.master.webui.port 8080

Adjust these values if other services are using these ports.

Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial for installing Apache Spark on your Ubuntu system. For additional or useful information, we recommend you check the official Apache website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t

r00t is a seasoned Linux system administrator with a wealth of experience in the field. Known for his contributions to idroot.us, r00t has authored numerous tutorials and guides, helping users navigate the complexities of Linux systems. His expertise spans across various Linux distributions, including Ubuntu, CentOS, and Debian. r00t's work is characterized by his ability to simplify complex concepts, making Linux more accessible to users of all skill levels. His dedication to the Linux community and his commitment to sharing knowledge makes him a respected figure in the field.
Back to top button