AlmaLinuxRHEL Based

How To Install Apache Spark on AlmaLinux 10

Install Apache Spark on AlmaLinux 10

Apache Spark stands as one of the most powerful unified analytics engines for large-scale data processing in today’s enterprise computing landscape. When combined with AlmaLinux 10’s enterprise-grade stability and security features, organizations gain a robust platform capable of handling massive datasets with exceptional performance. This comprehensive guide walks through every step of installing Apache Spark on AlmaLinux 10, ensuring a production-ready deployment that meets enterprise standards.

Understanding Apache Spark and AlmaLinux 10

What is Apache Spark?

Apache Spark represents a paradigm shift in big data processing, offering in-memory computation capabilities that dramatically outperform traditional disk-based systems. The framework provides unified APIs across multiple programming languages including Java, Scala, Python, and R, making it accessible to diverse development teams. Spark’s core strength lies in its ability to cache intermediate results in memory, reducing the need for expensive disk I/O operations that plague traditional MapReduce frameworks.

The platform excels in various use cases including batch processing, real-time stream processing, machine learning workloads, and interactive analytics. Organizations leverage Spark for data ETL operations, recommendation systems, fraud detection, and complex analytical queries across petabyte-scale datasets.

AlmaLinux 10 Overview

AlmaLinux 10 emerges as a premier enterprise Linux distribution, providing binary compatibility with Red Hat Enterprise Linux while maintaining complete open-source accessibility. The operating system delivers enhanced security features, improved container support, and optimized performance characteristics that make it ideal for big data workloads. Its long-term support model and enterprise-focused development approach ensure stability for mission-critical Spark deployments.

System Requirements and Prerequisites

Hardware Requirements

Successful Apache Spark deployment requires careful consideration of system resources. The minimum hardware specification includes 8GB of RAM, though production environments typically require 32GB or more for optimal performance. Spark follows a general rule of allocating approximately 75% of available system memory to the JVM heap, leaving adequate space for operating system operations and other processes.

CPU requirements vary based on workload complexity, but modern multi-core processors with at least 8 cores provide good baseline performance. Storage considerations include fast SSD drives for temporary data spilling and adequate network bandwidth for distributed cluster communication. For development environments, 4 CPU cores and 8GB RAM suffice, while production deployments benefit from 16+ cores and 64GB+ RAM configurations.

Software Prerequisites

Java Development Kit installation forms the foundation of any Spark deployment. AlmaLinux 10 supports multiple OpenJDK versions, with Java 11 and Java 17 being the most commonly deployed for Spark workloads. The system requires specific Java versions depending on the Spark release, making version compatibility verification crucial before installation begins.

Additional software dependencies include curl for downloading packages, tar for archive extraction, and sudo privileges for system-level configuration changes. Optional components such as Python development libraries enable PySpark functionality, while Scala installations support native Spark development workflows.

Pre-Installation Preparation

System Updates

Beginning with a fully updated AlmaLinux 10 system ensures compatibility and security. The DNF package manager provides comprehensive update capabilities through simple commands that refresh package repositories and install latest security patches. System administrators should verify repository configurations and ensure network connectivity to official AlmaLinux mirrors before proceeding.

Execute the following commands to update the system:

sudo dnf update -y && sudo dnf upgrade -y
sudo dnf install epel-release -y

These commands refresh all system packages while installing the Extra Packages for Enterprise Linux repository, providing access to additional software components that may be required during Spark installation.

Creating System Users

Security best practices mandate running Spark services under dedicated system accounts rather than root privileges. Creating a dedicated sparkuser account isolates Spark processes and limits potential security exposure in multi-tenant environments. The user account should have appropriate permissions for accessing Spark installation directories while maintaining restricted access to sensitive system resources.

User creation involves several steps:

sudo useradd -m -s /bin/bash sparkuser
sudo usermod -aG wheel sparkuser

This configuration creates a home directory for the Spark user and grants sudo privileges when necessary for administrative tasks.

Installing Java Development Kit

Java Installation Methods

OpenJDK installation on AlmaLinux 10 leverages the DNF package manager for streamlined deployment and automatic dependency resolution. The distribution includes multiple Java versions in its repositories, allowing administrators to select the most appropriate version for their Spark deployment. Java 11 provides excellent stability and performance characteristics for most Spark workloads, while Java 17 offers enhanced garbage collection and security improvements.

Install Java 11 using the following command:

sudo dnf install java-11-openjdk java-11-openjdk-devel -y

The installation includes both the runtime environment and development tools necessary for Spark operation. Verification ensures proper installation:

java -version
javac -version

Expected output should display OpenJDK version information confirming successful installation.

Java Configuration

Proper Java environment configuration requires setting JAVA_HOME variables that Spark can locate during startup. AlmaLinux 10 typically installs OpenJDK in /usr/lib/jvm/ directory structures, requiring administrators to identify the exact installation path. Environment variable configuration ensures consistent Java access across user sessions and system reboots.

Locate the Java installation directory:

sudo find /usr/lib/jvm/ -name "java-11-openjdk*"
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el9_2.x86_64

Add JAVA_HOME to the system environment:

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el9_2.x86_64' | sudo tee -a /etc/environment
source /etc/environment

This configuration makes Java accessible system-wide and persists across reboots.

Downloading and Installing Apache Spark

Downloading Spark

Apache Spark distribution requires manual download from official mirrors due to its absence from standard AlmaLinux repositories. The Apache Software Foundation maintains multiple download mirrors worldwide, providing reliable access to current and historical Spark releases. Selecting the appropriate Spark version involves considering Hadoop compatibility, Scala version requirements, and specific feature needs.

Navigate to a suitable directory and download Spark:

cd /opt
sudo wget https://www.apache.org/dyn/closer.lua/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

Verify download integrity using checksums provided on the official download page. This step ensures file integrity and prevents corrupted installations that could cause runtime issues.

Installation Process

Spark installation involves extracting the downloaded archive and organizing files in appropriate system directories. Standard practice places Spark installations in /opt/spark or /usr/local/spark directories, providing system-wide access while maintaining clear separation from other software. Proper directory permissions ensure security while allowing necessary access for service operations.

Extract and install Spark:

sudo tar xvf spark-4.0.1-bin-hadoop3.tgz
sudo mv spark-4.0.1-bin-hadoop3 /opt/spark
sudo chown -R sparkuser:sparkuser /opt/spark

These commands extract the archive, move it to a standard location, and assign ownership to the dedicated Spark user account.

Configuring Environment Variables

Environment variable configuration enables system-wide Spark access and proper service operation. SPARK_HOME points to the installation directory, while PATH modifications allow command-line access to Spark utilities. Configuration files must account for both interactive user sessions and system service requirements.

Configure Spark environment variables:

sudo nano /etc/environment

Add the following lines:

SPARK_HOME=/opt/spark
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

User-specific configuration in .bashrc files:

echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
source ~/.bashrc

This dual configuration approach ensures Spark accessibility for both interactive sessions and system services.

Creating Systemd Service Files

Master Service Configuration

Systemd service files enable automatic Spark startup and proper integration with AlmaLinux 10’s service management infrastructure. The master service coordinates cluster operations and provides web interfaces for monitoring and management. Service configuration must specify proper user contexts, dependency relationships, and startup parameters for reliable operation.

Create the Spark master service file:

sudo nano /etc/systemd/system/spark-master.service

Service file content:

[Unit]
Description=Apache Spark Master
After=network.target
Wants=network.target

[Service]
User=sparkuser
Group=sparkuser
Type=forking
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure
RestartSec=10
Environment=SPARK_HOME=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el9_2.x86_64

[Install]
WantedBy=multi-user.target

Worker Service Configuration

Worker services connect to the master node and execute distributed computations across cluster resources. Each worker requires configuration specifying connection parameters, resource allocation, and operational parameters. Service files must account for memory allocation, CPU core assignment, and network communication requirements.

Create the worker service file:

sudo vim /etc/systemd/system/spark-worker.service

Worker service configuration:

[Unit]
Description=Apache Spark Worker
After=network.target spark-master.service
Wants=network.target
Requires=spark-master.service

[Service]
User=sparkuser
Group=sparkuser
Type=forking
ExecStart=/opt/spark/sbin/start-worker.sh spark://localhost:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=on-failure
RestartSec=10
Environment=SPARK_HOME=/opt/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el9_2.x86_64

[Install]
WantedBy=multi-user.target

Enable and start services:

sudo systemctl daemon-reload
sudo systemctl enable spark-master spark-worker
sudo systemctl start spark-master spark-worker

Security Configuration

Basic Security Setup

Apache Spark security encompasses multiple layers including authentication, authorization, and encryption capabilities. Production deployments require careful attention to security configurations that protect data and prevent unauthorized access. AlmaLinux 10 provides robust security features that complement Spark’s built-in security mechanisms.

Configure basic Spark security by editing the configuration file:

sudo nano /opt/spark/conf/spark-defaults.conf

Add security configurations:

spark.authenticate true
spark.authenticate.secret your-secret-key
spark.network.crypto.enabled true
spark.io.encryption.enabled true
spark.ssl.enabled true
spark.acls.enable true

These settings enable authentication, encryption, and access controls for secure operation.

Advanced Security Measures

Firewall configuration protects Spark services while allowing necessary communication ports. AlmaLinux 10’s firewalld service provides comprehensive network security management. Default Spark installations require several ports for web interfaces, inter-node communication, and client connections.

Configure firewall rules:

sudo firewall-cmd --permanent --add-port=4040/tcp
sudo firewall-cmd --permanent --add-port=7077/tcp
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8081/tcp
sudo firewall-cmd --reload

SELinux configuration may require adjustments for proper Spark operation:

sudo setsebool -P httpd_can_network_connect 1
sudo semanage port -a -t http_port_t -p tcp 8080
sudo semanage port -a -t http_port_t -p tcp 4040

Starting and Managing Spark Services

Service Management

Systemctl commands provide comprehensive service management capabilities for Spark deployments. Administrators can start, stop, restart, and monitor service status through standard Linux service management interfaces. Service logs provide valuable troubleshooting information when issues arise.

Basic service management commands:

# Start services
sudo systemctl start spark-master
sudo systemctl start spark-worker

# Check service status
sudo systemctl status spark-master
sudo systemctl status spark-worker

# View service logs
sudo journalctl -u spark-master -f
sudo journalctl -u spark-worker -f

Cluster Configuration

Standalone cluster mode provides simple cluster management without external resource managers like YARN or Mesos. Configuration involves specifying master URLs, worker resources, and application deployment parameters. Multi-node setups require network configuration and coordinated service startup across cluster members.

Configure cluster settings in spark-defaults.conf:

spark.master spark://your-master-hostname:7077
spark.executor.memory 4g
spark.executor.cores 2
spark.driver.memory 2g
spark.serializer org.apache.spark.serializer.KryoSerializer

Verification and Testing

Installation Verification

Comprehensive installation testing ensures all components function correctly before production deployment. Spark provides multiple interfaces for testing including command-line shells, web interfaces, and programmatic APIs. Testing should verify basic functionality, resource allocation, and inter-service communication.

Test Spark shell functionality:

spark-shell --conf spark.driver.bindAddress=localhost

Expected output includes Spark version information and an interactive Scala prompt. The web interface should be accessible at http://localhost:8080 showing cluster status and resource utilization.

Performance Testing

Basic performance validation ensures proper resource allocation and system optimization. Simple benchmark applications can verify memory allocation, CPU utilization, and network communication patterns. Performance testing identifies potential bottlenecks before production workloads begin.

Run a simple performance test:

scala> val data = sc.parallelize(1 to 100000)
scala> val result = data.map(x => x * x).reduce(_ + _)
scala> println(s"Sum of squares: $result")

Monitor resource utilization during test execution using system monitoring tools and Spark’s web interface.

Performance Optimization

Memory Configuration

Memory optimization significantly impacts Spark application performance and resource utilization efficiency. Proper configuration balances JVM heap allocation with system requirements while avoiding memory pressure that degrades performance. AlmaLinux 10’s memory management capabilities complement Spark’s memory allocation strategies.

Configure memory settings:

export SPARK_WORKER_MEMORY=6g
export SPARK_DAEMON_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=4g
export SPARK_DRIVER_MEMORY=2g

JVM optimization parameters:

spark.driver.extraJavaOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=200

System Tuning

Operating system tuning complements Spark optimization for maximum performance. AlmaLinux 10 provides numerous tuning opportunities including kernel parameters, I/O scheduling, and network optimization. System-level optimizations often provide significant performance improvements for data-intensive workloads.

Network optimization:

echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Troubleshooting Common Issues

Installation and configuration issues commonly arise during Spark deployment on AlmaLinux 10. Java version incompatibilities represent frequent problems, particularly when multiple JDK versions exist on the system. Permission issues affect service startup and file access, requiring careful attention to user accounts and directory ownership.

Common troubleshooting steps include:

  • Verifying Java installation and JAVA_HOME configuration
  • Checking file permissions and ownership in Spark directories
  • Confirming network connectivity and firewall rules
  • Reviewing service logs for error messages and stack traces
  • Testing port availability and service binding addresses

Memory allocation errors often indicate insufficient system resources or improper JVM configuration. Port conflicts arise when other services occupy Spark’s default ports, requiring either service reconfiguration or port remapping.

Best Practices and Maintenance

Operational Best Practices

Production Spark deployments require ongoing maintenance including security updates, performance monitoring, and capacity planning. AlmaLinux 10’s update mechanisms provide automated security patching while maintaining system stability. Regular backup procedures protect configuration and application code from data loss.

Establish monitoring procedures:

# Monitor service health
sudo systemctl status spark-master spark-worker

# Check resource utilization
htop
iotop
nethogs

Log rotation prevents disk space exhaustion:

sudo vim /etc/logrotate.d/spark

Configure log rotation parameters for Spark log files in /opt/spark/logs/.

Monitoring and Maintenance

Comprehensive monitoring encompasses system resources, application performance, and service availability. Spark provides built-in metrics and web interfaces for performance monitoring. Integration with external monitoring systems enables alerting and historical trend analysis.

Configure basic monitoring:

# Enable Spark history server
sudo systemctl enable spark-history-server
sudo systemctl start spark-history-server

Regular maintenance tasks include:

  • Reviewing and rotating log files
  • Monitoring disk space utilization
  • Updating Spark and Java versions
  • Performance tuning based on workload patterns
  • Security patch application and vulnerability assessment

Integration with Hadoop Ecosystem

Apache Spark integrates seamlessly with Hadoop ecosystem components including HDFS for distributed storage and YARN for resource management. AlmaLinux 10 supports complete Hadoop ecosystem deployments, enabling comprehensive big data processing pipelines. Integration configuration requires attention to compatibility versions and network connectivity between services.

HDFS integration enables distributed data storage:

spark.hadoop.fs.defaultFS hdfs://namenode:9000
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:9000/spark-logs

External data source connectivity includes databases, message queues, and cloud storage systems. Spark’s extensive connector ecosystem supports most enterprise data sources through standardized APIs and configuration parameters.

Congratulations! You have successfully installed Spark. Thanks for using this tutorial for installing Apache Spark open-source unified analytics engine on your AlmaLinux OS 10 system. For additional help or useful information, we recommend you check the official Apache website.

VPS Manage Service Offer
If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t

r00t is an experienced Linux enthusiast and technical writer with a passion for open-source software. With years of hands-on experience in various Linux distributions, r00t has developed a deep understanding of the Linux ecosystem and its powerful tools. He holds certifications in SCE and has contributed to several open-source projects. r00t is dedicated to sharing her knowledge and expertise through well-researched and informative articles, helping others navigate the world of Linux with confidence.
Back to top button