FedoraRHEL Based

How To Install Apache Spark on Fedora 42

Install Apache Spark on Fedora 42

Apache Spark has revolutionized big data processing with its lightning-fast in-memory computing capabilities and unified analytics engine. Installing Apache Spark on Fedora 42 provides developers and data scientists with a robust platform for distributed data processing, machine learning, and real-time analytics. This comprehensive guide will walk you through every step of the installation process, ensuring a successful Spark deployment on your Fedora 42 system.

Understanding Apache Spark and Its Ecosystem

Apache Spark stands as one of the most powerful open-source distributed computing frameworks available today. It provides unified analytics for large-scale data processing, supporting multiple programming languages including Scala, Java, Python, and R. Spark’s core strength lies in its ability to perform in-memory computations, making it significantly faster than traditional disk-based processing frameworks like Hadoop MapReduce.

The Spark ecosystem consists of several integrated components that work seamlessly together. Spark Core serves as the foundation, providing basic I/O functionalities, task scheduling, and memory management. Spark SQL enables structured data processing using SQL queries and DataFrames. Spark Streaming handles real-time data processing, while MLlib provides machine learning algorithms and utilities. Finally, GraphX offers graph processing capabilities for social network analysis and recommendation systems.

Fedora 42 represents an excellent choice for Apache Spark deployment due to its cutting-edge kernel, advanced package management through DNF, and excellent hardware support. The distribution’s focus on innovation and stability makes it ideal for data processing workloads that require both performance and reliability.

System Requirements and Prerequisites

Before proceeding with the Apache Spark installation on Fedora 42, ensure your system meets the necessary hardware and software requirements. Minimum hardware specifications include a multi-core processor with at least 4 CPU cores, though 8 or more cores are recommended for optimal performance. Memory requirements start at 8GB RAM minimum, but 16GB or more is strongly recommended for production environments and large dataset processing.

Storage requirements include at least 10GB of free disk space for the base installation, though you should allocate significantly more space if you plan to process large datasets. A solid-state drive (SSD) is highly recommended for improved I/O performance during data processing operations. Network connectivity should be stable and fast, especially if you plan to set up a distributed Spark cluster or access remote data sources.

Software prerequisites begin with a fresh Fedora 42 installation with all available system updates applied. You’ll need administrative privileges (sudo access) to install packages and configure system services. Java Development Kit (JDK) 11 or later is absolutely essential, as Spark is built on the Java Virtual Machine (JVM). Python 3.7 or later is required if you plan to use PySpark for Python-based Spark applications.

Security considerations include configuring firewall rules if you plan to access Spark’s web interfaces remotely. The default Spark web UI runs on port 4040, while the standalone cluster manager uses ports 7077 (master) and 8080 (web UI). Ensure these ports are accessible according to your security requirements and network topology.

Preparing the Fedora 42 Environment

System preparation begins with updating all installed packages to their latest versions. Execute the following commands to ensure your Fedora 42 system is current:

sudo dnf update -y
sudo dnf install -y curl wget tar which

This process updates the package database and installs essential utilities required during the Spark installation process. The update operation may take several minutes depending on your internet connection and the number of packages requiring updates.

Installing Java Development Kit (JDK) represents the most critical prerequisite for Apache Spark. Fedora 42 provides OpenJDK packages through its default repositories. Install OpenJDK 17, which offers excellent performance and long-term support:

sudo dnf install -y java-17-openjdk java-17-openjdk-devel

Verify the Java installation by checking the version:

java -version
javac -version

Configure the JAVA_HOME environment variable by adding the following line to your ~/.bashrc file:

echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk' >> ~/.bashrc
source ~/.bashrc

Installing optional dependencies enhances your Spark development experience. For Python users planning to use PySpark, ensure Python 3 and pip are available:

sudo dnf install -y python3 python3-pip python3-devel
pip3 install --user numpy pandas matplotlib jupyter

Development tools and libraries that support Spark development include:

sudo dnf groupinstall -y "Development Tools"
sudo dnf install -y git vim nano htop

Downloading and Installing Apache Spark

Navigate to the /opt directory, which is the conventional location for optional software installations on Linux systems:

cd /opt

Download the latest stable Apache Spark release from the official Apache Software Foundation mirror. As of this guide, Apache Spark 4.0.1 represents the current stable release with Hadoop 3.3 support:

sudo wget https://downloads.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

Verify the download integrity by checking the file’s checksum against the official SHA-512 hash provided on the Apache Spark downloads page:

sudo wget https://downloads.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz.sha512
sha512sum -c spark-4.0.1-bin-hadoop3.tgz.sha512

The verification should return “OK” confirming the download’s integrity. If the verification fails, re-download the package before proceeding.

Extract and install Apache Spark by decompressing the downloaded archive and creating a convenient symbolic link:

sudo tar -xzf spark-4.0.1-bin-hadoop3.tgz
sudo mv spark-4.0.1-bin-hadoop3 spark
sudo chown -R root:root /opt/spark
sudo chmod -R 755 /opt/spark

Create a symbolic link for easier version management:

sudo ln -sf /opt/spark /opt/spark-current

Clean up the installation files to free disk space:

sudo rm spark-4.0.1-bin-hadoop3.tgz spark-4.0.1-bin-hadoop3.tgz.sha512

Set appropriate permissions ensuring the Spark installation is accessible by all users while maintaining security:

sudo find /opt/spark -type d -exec chmod 755 {} \;
sudo find /opt/spark -type f -exec chmod 644 {} \;
sudo chmod +x /opt/spark/bin/*
sudo chmod +x /opt/spark/sbin/*

Configuration and Environment Setup

Configure environment variables by creating a comprehensive Spark environment configuration. Add the following variables to your ~/.bashrc file:

cat >> ~/.bashrc << 'EOF'
# Apache Spark Environment Configuration
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export SPARK_LOCAL_IP=127.0.0.1

# Spark Performance Configuration
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_CORES=2
export SPARK_MASTER_MEMORY=1g
EOF

Apply the environment changes immediately:

source ~/.bashrc

Configure Spark-specific settings by creating customized configuration files from the provided templates:

cd $SPARK_HOME/conf
sudo cp spark-env.sh.template spark-env.sh
sudo cp spark-defaults.conf.template spark-defaults.conf
sudo cp log4j2.properties.template log4j2.properties

Edit the spark-env.sh configuration file to include Fedora 42-specific optimizations:

sudo nano spark-env.sh

Add the following configurations to the file:

#!/usr/bin/env bash
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
export SPARK_MASTER_HOST=localhost
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_CORES=2
export SPARK_DRIVER_MEMORY=2g
export SPARK_EXECUTOR_MEMORY=2g
export SPARK_LOCAL_DIRS=/tmp/spark

Configure spark-defaults.conf for optimal performance on Fedora 42:

sudo nano spark-defaults.conf

Add these performance-optimized settings:

spark.master                     spark://localhost:7077
spark.eventLog.enabled           true
spark.eventLog.dir               /tmp/spark-events
spark.sql.warehouse.dir          /tmp/spark-warehouse
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2g
spark.executor.memory            2g
spark.executor.cores             2
spark.sql.adaptive.enabled       true
spark.sql.adaptive.coalescePartitions.enabled true

Create the necessary directories for Spark temporary files and logs:

sudo mkdir -p /tmp/spark-events /tmp/spark-warehouse /tmp/spark
sudo chmod 777 /tmp/spark-events /tmp/spark-warehouse /tmp/spark

Creating Systemd Services for Automated Management

Creating systemd services enables automatic startup and professional management of Spark services. Create a Spark master service by generating a systemd unit file:

sudo nano /etc/systemd/system/spark-master.service

Configure the master service with the following content:

[Unit]
Description=Apache Spark Master
After=network.target
Wants=network-online.target

[Service]
Type=forking
User=root
Group=root
WorkingDirectory=/opt/spark
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure
RestartSec=10
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk
Environment=SPARK_HOME=/opt/spark

[Install]
WantedBy=multi-user.target

Create a Spark worker service for standalone cluster deployments:

sudo nano /etc/systemd/system/spark-worker.service

Configure the worker service:

[Unit]
Description=Apache Spark Worker
After=network.target spark-master.service
Wants=network-online.target
Requires=spark-master.service

[Service]
Type=forking
User=root
Group=root
WorkingDirectory=/opt/spark
ExecStart=/opt/spark/sbin/start-worker.sh spark://localhost:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=on-failure
RestartSec=10
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk
Environment=SPARK_HOME=/opt/spark

[Install]
WantedBy=multi-user.target

Enable and manage the Spark services using systemctl commands:

sudo systemctl daemon-reload
sudo systemctl enable spark-master
sudo systemctl enable spark-worker
sudo systemctl start spark-master
sudo systemctl start spark-worker

Verify service status and troubleshoot any issues:

sudo systemctl status spark-master
sudo systemctl status spark-worker
sudo journalctl -u spark-master -f
sudo journalctl -u spark-worker -f

Configure firewall rules to allow access to Spark’s web interfaces and cluster communication:

sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=4040/tcp
sudo firewall-cmd --reload

Testing and Verification Procedures

Verify the basic Spark installation by launching the Spark shell and performing simple operations:

spark-shell

Within the Spark shell, execute the following Scala commands to test functionality:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
val sum = distData.reduce(_ + _)
println(s"Sum: $sum")
:quit

Test PySpark functionality to ensure Python integration works correctly:

pyspark

Execute these Python commands in the PySpark shell:

import pyspark
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("TestApp").getOrCreate()

# Create a simple DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Perform a simple aggregation
df.groupBy().avg("Age").show()

# Stop the session
spark.stop()
exit()

Run built-in example applications to verify cluster functionality:

run-example SparkPi 10
run-example org.apache.spark.examples.sql.BasicExample
run-example org.apache.spark.examples.ml.LinearRegressionExample

Access the Spark Web UI to monitor cluster status and application execution. Open your web browser and navigate to http://localhost:8080 for the cluster overview and http://localhost:4040 for running application details.

The web interface provides valuable information including:

  • Active applications and their resource usage
  • Completed applications and execution history
  • Worker node status and available resources
  • Job and stage execution details
  • SQL query execution plans and performance metrics

Performance benchmarking helps establish baseline performance characteristics:

spark-submit --class org.apache.spark.examples.SparkPi \
  --master spark://localhost:7077 \
  --executor-memory 2g \
  --total-executor-cores 4 \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 1000

Monitor system resources during execution using:

htop
iostat -x 1
free -h

Advanced Configuration and Optimization

Security configuration becomes essential for production environments. Enable Spark authentication by adding these settings to spark-defaults.conf:

spark.authenticate                    true
spark.authenticate.secret             your-secret-key-here
spark.network.crypto.enabled          true
spark.io.encryption.enabled           true
spark.ssl.enabled                     true

Generate SSL certificates for encrypted communication:

sudo mkdir -p /opt/spark/ssl
cd /opt/spark/ssl
sudo openssl req -x509 -newkey rsa:4096 -keyout spark-keystore.key -out spark-keystore.crt -days 365 -nodes
sudo chown spark:spark /opt/spark/ssl/*

Memory optimization requires careful tuning based on your workload characteristics and available system resources. Key parameters include:

spark.driver.memory                   4g
spark.driver.maxResultSize            2g
spark.executor.memory                 4g
spark.executor.memoryFraction         0.8
spark.executor.instances              2
spark.sql.adaptive.enabled            true
spark.sql.adaptive.coalescePartitions.enabled    true

Integration with external storage systems extends Spark’s capabilities for enterprise environments. Configure AWS S3 access:

spark.hadoop.fs.s3a.access.key        your-access-key
spark.hadoop.fs.s3a.secret.key        your-secret-key
spark.hadoop.fs.s3a.endpoint          s3.amazonaws.com

HDFS integration for Hadoop ecosystem compatibility:

sudo dnf install -y hadoop-client
export HADOOP_CONF_DIR=/etc/hadoop/conf

Database connectivity through JDBC requires appropriate driver JARs:

cd $SPARK_HOME/jars
sudo wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.33/mysql-connector-java-8.0.33.jar
sudo wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.6.0/postgresql-42.6.0.jar

Troubleshooting Common Installation Issues

Java-related problems represent the most frequent installation issues. If you encounter “Java gateway process exited” errors, verify your JAVA_HOME configuration:

echo $JAVA_HOME
ls -la $JAVA_HOME/bin/java

Ensure the Java version meets Spark’s requirements:

java -version 2>&1 | head -1

Memory allocation errors often manifest as “Java heap space” or “GC overhead limit exceeded” messages. Increase driver and executor memory allocations:

export SPARK_DRIVER_MEMORY=4g
export SPARK_EXECUTOR_MEMORY=4g

For persistent memory issues, modify the spark-defaults.conf file with appropriate memory settings based on your system’s available RAM.

Port conflict resolution becomes necessary when default Spark ports are already in use. Check for port conflicts:

sudo netstat -tulpn | grep :8080
sudo netstat -tulpn | grep :7077

Modify port assignments in spark-env.sh:

export SPARK_MASTER_PORT=7078
export SPARK_MASTER_WEBUI_PORT=8081

Permission-related issues can prevent Spark from accessing necessary directories. Ensure proper ownership and permissions:

sudo chown -R $USER:$USER /tmp/spark*
chmod 755 /tmp/spark*

Network connectivity problems in cluster deployments require careful firewall and hostname configuration. Verify hostname resolution:

hostname -f
ping $(hostname -f)

Configure /etc/hosts if necessary:

echo "127.0.0.1 $(hostname -f)" | sudo tee -a /etc/hosts

Log analysis provides detailed debugging information. Monitor Spark logs in real-time:

tail -f $SPARK_HOME/logs/spark-*-master-*.out
tail -f $SPARK_HOME/logs/spark-*-worker-*.out

Enable verbose logging by modifying log4j2.properties:

sudo sed -i 's/rootLogger.level = info/rootLogger.level = debug/g' $SPARK_HOME/conf/log4j2.properties

Performance Optimization and Best Practices

Resource allocation strategies significantly impact Spark application performance. Configure executor resources based on your cluster’s characteristics:

spark.executor.instances             4
spark.executor.cores                 2
spark.executor.memory                4g
spark.executor.memoryOverhead        512m

Storage optimization improves I/O performance through proper disk configuration. Use local SSDs for Spark’s temporary directories:

sudo mkdir -p /mnt/ssd/spark-tmp
sudo chown $USER:$USER /mnt/ssd/spark-tmp
export SPARK_LOCAL_DIRS=/mnt/ssd/spark-tmp

Networking optimization reduces cluster communication overhead. Configure high-bandwidth network interfaces and enable network compression:

spark.rdd.compress                   true
spark.shuffle.compress               true
spark.shuffle.spill.compress         true

Monitoring and maintenance procedures ensure long-term system reliability. Set up log rotation:

sudo nano /etc/logrotate.d/spark

Configure log rotation:

/opt/spark/logs/*.out {
    daily
    missingok
    rotate 7
    compress
    notifempty
    create 644 root root
}

Regular maintenance tasks include updating Spark versions, monitoring disk usage, and reviewing performance metrics. Create a maintenance script:

#!/bin/bash
# Spark Maintenance Script
echo "Cleaning Spark temporary files..."
rm -rf /tmp/spark-*
echo "Checking disk usage..."
df -h /opt/spark
echo "Updating system packages..."
sudo dnf update -y

Schedule maintenance through cron:

echo "0 2 * * 0 /home/$USER/spark-maintenance.sh" | crontab -

Congratulations! You have successfully installed Apache Spark. Thanks for using this tutorial for installing Apache Spark on Fedora 42 Linux system. For additional help or useful information, we recommend you check the official Spark website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t

r00t is an experienced Linux enthusiast and technical writer with a passion for open-source software. With years of hands-on experience in various Linux distributions, r00t has developed a deep understanding of the Linux ecosystem and its powerful tools. He holds certifications in SCE and has contributed to several open-source projects. r00t is dedicated to sharing her knowledge and expertise through well-researched and informative articles, helping others navigate the world of Linux with confidence.
Back to top button