How To Install Apache Spark on Ubuntu 24.04 LTS
Apache Spark stands at the forefront of big data analytics, offering unparalleled speed and versatility. It’s an open-source, distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark’s ability to process data in-memory makes it significantly faster than traditional big data tools, especially for iterative algorithms and interactive data analysis.
Ubuntu 24.04 LTS, the latest long-term support release from Canonical, provides a stable and secure foundation for running Apache Spark. This version boasts improved performance, enhanced security features, and long-term support, making it an ideal choice for deploying Spark in production environments.
In this comprehensive guide, we’ll walk through the step-by-step process of installing Apache Spark on Ubuntu 24.04 LTS. Whether you’re setting up a development environment or preparing for a large-scale deployment, this tutorial will equip you with the knowledge to get Spark up and running smoothly.
Prerequisites
Before diving into the installation process, ensure your system meets the following requirements:
System Requirements
- A machine running Ubuntu 24.04 LTS (server or desktop edition)
- At least 4GB of RAM (8GB or more recommended for optimal performance)
- Minimum 10GB of free disk space
- An active internet connection for downloading packages
Software Prerequisites
- Administrative privileges (sudo access)
- Basic familiarity with the Linux command line
Additionally, you’ll need to install the following packages and tools:
- Java Development Kit (JDK)
- Curl or wget for downloading files
- Tar for extracting compressed files
Most of these tools come pre-installed on Ubuntu, but we’ll cover their installation in the following steps to ensure everything is set up correctly.
Step 1: Update System Packages
Before installing any new software, it’s crucial to update your system’s package list and upgrade existing packages. This ensures you have the latest security updates and bug fixes.
Open a terminal and run the following command:
sudo apt update && sudo apt -y upgrade
This command updates the package list and upgrades all installed packages to their latest versions. The -y
flag automatically answers “yes” to any prompts, streamlining the upgrade process.
Step 2: Install Java Development Kit (JDK)
Apache Spark requires Java to run. While Spark 3.x supports Java 8 and later versions, it’s recommended to use Java 11 for optimal compatibility and performance.
To install the default JDK (which is OpenJDK 11 in Ubuntu 24.04 LTS), run:
sudo apt install default-jdk -y
After the installation completes, verify the Java version by running:
java -version
You should see output similar to this:
openjdk version "11.0.XX" 20XX-XX-XX
OpenJDK Runtime Environment (build 11.0.XX+XX-Ubuntu-XXubuntuX.XX.XX)
OpenJDK 64-Bit Server VM (build 11.0.XX+XX-Ubuntu-XXubuntuX.XX.XX, mixed mode, sharing)
If you see this output, Java is successfully installed on your system.
Step 3: Download Apache Spark
Now that Java is installed, we can proceed to download Apache Spark. Visit the official Apache Spark website to find the latest stable version. At the time of writing, the latest version is 3.5.2, but you should check for the most recent release.
To download Spark, use the wget
command:
wget https://dlcdn.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
Replace 3.5.2
with the latest version number if a newer release is available. This command downloads the Spark binary package pre-built for Apache Hadoop 3.x.
Step 4: Extract and Move Apache Spark
Once the download is complete, extract the downloaded file using the tar
command:
tar xvf spark-3.5.2-bin-hadoop3.tgz
This command will create a new directory named spark-3.5.2-bin-hadoop3
in your current location.
Next, move the extracted directory to /opt/spark
for easier management and access:
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
This step centralizes the Spark installation and makes it accessible to all users on the system.
Step 5: Configure Environment Variables
To use Spark from any location on your system, you need to set up environment variables. Edit the .bashrc
file in your home directory:
nano ~/.bashrc
Add the following lines at the end of the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
These lines set the SPARK_HOME
variable and add Spark’s binary directories to your system’s PATH.
Save the file and exit the editor (in nano, press Ctrl+X, then Y, then Enter). To apply these changes to your current session, run:
source ~/.bashrc
Step 6: Start Apache Spark
With the environment variables set, you can now start the Spark master node. Run the following command:
$SPARK_HOME/sbin/start-master.sh
This script starts the Spark master process. You should see an output indicating that the master has started successfully.
To access the Spark web UI, open a web browser and navigate to:
http://localhost:8080
If you’re accessing Spark from a remote machine, replace localhost
with your server’s IP address or domain name.
The Spark UI provides valuable information about your Spark cluster, including worker nodes, running applications, and resource usage.
Step 7: Verify Installation
To ensure Spark is installed correctly and functioning as expected, start the Spark shell:
spark-shell
You should see a welcome message and a Scala prompt. This interactive shell allows you to run Spark commands and test your installation.
Try a simple Spark operation to verify everything is working:
val data = spark.range(1, 1000)
data.count()
This creates a dataset of numbers from 1 to 999 and counts them. If you see the result 999
, congratulations! Your Spark installation is working correctly.
Exit the Spark shell by typing :quit
or pressing Ctrl+D.
Step 8: Configure Systemd Service (Optional)
For production environments, it’s beneficial to set up Spark as a system service. This ensures Spark starts automatically on boot and can be easily managed using systemd
commands.
Create a new systemd
service file:
sudo nano /etc/systemd/system/spark-master.service
Add the following content to the file:
[Unit]
Description=Apache Spark Master
After=network.target
[Service]
Type=forking
User=ubuntu
Group=ubuntu
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
Replace ubuntu
with your system username if different.
Save the file and exit the editor. Then, reload the systemd
daemon to recognize the new service:
sudo systemctl daemon-reload
Now you can start and enable the Spark master service:
sudo systemctl start spark-master
sudo systemctl enable spark-master
These commands start the Spark master immediately and configure it to start automatically on system boot.
Troubleshooting Common Issues
Even with careful installation, you might encounter some issues. Here are solutions to common problems:
Java Not Found
If you receive a “Java not found” error, ensure Java is installed correctly:
java -version
If Java isn’t recognized, revisit Step 2 and reinstall the JDK.
Spark Commands Not Recognized
If Spark commands aren’t recognized, check your PATH settings:
echo $PATH
Ensure /opt/spark/bin
and /opt/spark/sbin
are in the output. If not, review Step 5 and update your .bashrc
file.
Port Conflicts
If Spark fails to start due to port conflicts, you can change the default ports in $SPARK_HOME/conf/spark-defaults.conf
:
spark.master.port 7077
spark.master.webui.port 8080
Adjust these values if other services are using these ports.
Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial for installing Apache Spark on your Ubuntu system. For additional or useful information, we recommend you check the official Apache website.