How To Install Apache Spark on CentOS Stream 10

In this tutorial, we will show you how to install Apache Spark on CentOS Stream 10. F Apache Spark has become an indispensable tool in the world of big data processing and analytics. As organizations continue to grapple with ever-increasing volumes of data, the need for efficient and scalable data processing solutions has never been greater. In this comprehensive guide, we’ll walk you through the process of installing Apache Spark on CentOS Stream 10, providing you with the knowledge and tools you need to harness the power of this cutting-edge technology.

Introduction

Apache Spark is an open-source, distributed computing system designed for large-scale data processing and analytics. It offers significant advantages over traditional data processing frameworks, including in-memory data storage and computation capabilities, which result in much faster processing times. Whether you’re dealing with batch processing, interactive queries, streaming data, or machine learning tasks, Spark provides a unified platform to handle it all.

CentOS Stream 10, the latest enterprise Linux distribution, offers a stable and robust environment for running Apache Spark. By combining the power of Spark with the reliability of CentOS Stream 10, you’ll be well-equipped to tackle even the most demanding data processing challenges.

This tutorial is aimed at data engineers, system administrators, and developers who want to set up a Spark environment on CentOS Stream 10. We’ll cover everything from system requirements to advanced configuration options, ensuring you have a fully functional Spark installation by the end of this guide.

System Requirements

Before we dive into the installation process, let’s review the system requirements for running Apache Spark on CentOS Stream 10.

Hardware Requirements

To run Apache Spark effectively, your system should meet or exceed the following specifications:

CPU: x86_64_v3 architecture or later
RAM: Minimum 8GB, recommended 16GB or more for larger datasets
Storage: At least 10GB of free disk space for Spark and its dependencies

Software Prerequisites

Ensure your CentOS Stream 10 system has the following software components:

CentOS Stream 10 base installation
Internet connectivity for downloading packages
Root or sudo privileges for installation and configuration

Pre-Installation Setup

Before we install Apache Spark, we need to prepare our CentOS Stream 10 system by updating packages and installing necessary dependencies.

System Updates

First, let’s update the system packages to ensure we have the latest security patches and software versions:

sudo dnf update -y

Next, install the development tools, which will be necessary for compiling certain dependencies:

sudo dnf groupinstall "Development Tools" -y

Java Installation

Apache Spark requires Java to run. We’ll install OpenJDK 21, which is compatible with the latest versions of Spark:

sudo dnf install java-21-openjdk-devel -y

After installation, set up the Java environment variables by adding the following lines to your ~/.bashrc file:

echo 'export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))' >> ~/.bashrc
echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
source ~/.bashrc

Verify the Java installation by running:

java -version

You should see output indicating the installed Java version.

Apache Spark Installation Process

Now that we have our system prepared, let’s proceed with the Apache Spark installation.

Download and Extract

First, we’ll download the latest stable version of Apache Spark. At the time of writing, this is version 3.5.4, but you should check the official Apache Spark website for the most recent release.

wget https://downloads.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz

Once the download is complete, extract the archive:

tar xvf spark-3.5.4-bin-hadoop3.tgz

Move the extracted directory to a more suitable location:

sudo mv spark-3.5.4-bin-hadoop3 /opt/spark

Environment Configuration

To make Spark accessible system-wide, we need to set up some environment variables. Add the following lines to your ~/.bashrc file:

echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> ~/.bashrc
source ~/.bashrc

Verification Steps

To verify that Spark has been installed correctly, run the following command:

spark-shell

If the installation was successful, you should see the Spark shell starting up with the Spark logo and version information displayed.

Advanced Configuration

Now that we have a basic Spark installation up and running, let’s explore some advanced configuration options to optimize performance and set up a cluster.

Cluster Setup

For a standalone Spark cluster, you’ll need to configure the master and worker nodes. Edit the /opt/spark/conf/spark-env.sh file and add the following lines:

export SPARK_MASTER_HOST=
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=8g

Replace with the IP address of your master node.

To start the Spark master:

/opt/spark/sbin/start-master.sh

On each worker node, start the Spark worker:

/opt/spark/sbin/start-slave.sh spark://:7077

Performance Optimization

To optimize Spark performance, consider adjusting the following settings in /opt/spark/conf/spark-defaults.conf:

spark.executor.memory 4g
spark.executor.cores 2
spark.driver.memory 4g
spark.driver.cores 2
spark.default.parallelism 8

These settings should be adjusted based on your specific hardware and workload requirements.

Security Configuration

Securing your Spark installation is crucial, especially in a production environment. Here are some key security measures to implement:

Firewall Settings

Configure your firewall to allow traffic on the necessary Spark ports. For a basic setup, you’ll need to open ports 7077 (for Spark master) and 8080 (for the web UI):

sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --reload

Authentication Setup

To enable authentication for Spark, edit the spark-defaults.conf file and add:

spark.authenticate true
spark.authenticate.secret

Replace with a strong, unique secret key.

SSL/TLS Configuration

For encrypted communication, set up SSL/TLS by generating a keystore and adding the following to spark-defaults.conf:

spark.ssl.enabled true
spark.ssl.keyStore /path/to/keystore
spark.ssl.keyStorePassword

Integration with CentOS Stream 10 Features

CentOS Stream 10 offers several features that can enhance your Spark installation:

Systemd Integration

Create a systemd service file for Spark to manage it more easily:

sudo nano /etc/systemd/system/spark-master.service

Add the following content:

[Unit]
Description=Apache Spark Master
After=network.target

[Service]
Type=forking
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
User=spark

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl enable spark-master
sudo systemctl start spark-master

SELinux Configuration

If you have SELinux enabled, you may need to create a custom policy for Spark. Start by creating a policy file:

sudo nano /tmp/spark.te

Add the following content:

module spark 1.0;

require {
type unconfined_t;
type spark_port_t;
class tcp_socket name_connect;
}

#============= unconfined_t ==============
allow unconfined_t spark_port_t:tcp_socket name_connect;

Compile and load the policy:

checkmodule -M -m -o /tmp/spark.mod /tmp/spark.te
semodule_package -o /tmp/spark.pp -m /tmp/spark.mod
sudo semodule -i /tmp/spark.pp

Troubleshooting Common Issues

Even with careful installation and configuration, you may encounter some issues. Here are solutions to common problems:

Java Version Mismatch

If you see errors related to Java version incompatibility, ensure you’re using a compatible Java version. You can switch between installed Java versions using:

sudo alternatives --config java

Port Conflicts

If Spark fails to start due to port conflicts, check if the required ports are already in use:

sudo netstat -tulpn | grep LISTEN

If necessary, change the Spark ports in the configuration files.

Memory Issues

If your Spark jobs are failing due to out-of-memory errors, adjust the memory settings in spark-defaults.conf:

spark.driver.memory 4g
spark.executor.memory 4g

Increase these values based on your available system resources.

Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial for installing Apache Spark open-source framework on your CentOS Stream 10 system. For additional help or useful information, we recommend you check the official Apache Spark website.

VPS Manage Service Offer

If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!