The HA-WEBTRACK project applies RHCSA and RHCE principles to create a high-availability web server environment using open-source tools. Automated with Ansible and secured by SELinux, this project streamlines infrastructure setup, logging, and alert rule implementation. We'll enhance system performance clarity by testing high loads and analyzing failover scenarios. By converting Ansible playbooks into roles, the project boosts installation efficiency and code reusability, providing a practical experience in system monitoring and metrics analysis.
- VirtualBox: Manages a secure virtual environment for all server roles including the control node, HAProxy, and web servers.
- Red Hat Enterprise Linux (RHEL) VMs: Provides a stable and secure operating base for all nodes.
- Ansible: Automates configuration and management of the server infrastructure.
- HAProxy: Balances load across web servers to enhance service reliability.
- Apache HTTPD: Serves web content efficiently on the web servers.
- Prometheus and Grafana: Monitor system performance with real-time metrics visualization.
- Loki and Promtail: Handle log aggregation and shipping, ensuring detailed logging.
- Node Exporter: Gathers comprehensive system metrics for monitoring.
- Alertmanager: Integrates with Slack to send real-time alerts, enhancing incident response capabilities.
- GitHub: Hosts project code for version control and collaborative development.
This section outlines the specific versions of the tools and technologies deployed ensuring compatibility and stability across all components:
VirtualBox: 7.0.14 | Grafana-Enterprise: 11.2 |
RHEL VMs: 9.4 | Loki: 3.2 |
Ansible: 2.14 | Promtail: 3.2 |
HAProxy: 2.4 | Node Exporter: 1.8 |
Apache HTTPD: 2.4 | Alertmanager: 0.27 |
Prometheus: 2.54 | GitHub: Latest |
Before we begin, ensure the following are prepared:
- Four Red Hat RHEL 9 VMs: These will act as our control node, load balancer (HAProxy), and two web servers
Note: Throughout this project, all root and user password: 'password'
- Network Configuration: Set IP addresses and hostnames for each VM using tools like
nmtui
to ensure proper networking. Ensure that the networking mode is set toBridge Adapter
to allow the VMs to directly communicate with the network as independent devices.
Below is a table outlining the specifications for each server used in the project:
Server | Role | CPU | RAM | Additional Notes |
---|---|---|---|---|
Control Node | Management | 2 | 4 GB | Second disk provisioned (min 20 GB) |
Node1 (HAProxy) | Load Balancer | 2 | 4 GB | |
Node2 (WebServer) | Web Server | 1 | 1 GB | |
Node3 (WebServer) | Web Server | 1 | 1 GB |
Note: Ensure each server meets or exceeds the specifications listed to ensure optimal performance and reliability of the HA-WEBTRACK environment.
- Insert the RHEL ISO on control node
- Run the command to mount the ISO:
sudo mount /dev/sr0 /mnt
- Add and configure the repository from the ISO:
dnf config-manager --add-repo=file:///mnt/AppStream dnf config-manager --add-repo=file:///mnt/BaseOS echo "gpgcheck=0" >> /etc/yum.repos.d/mnt_AppStream.repo echo "gpgcheck=0" >> /etc/yum.repos.d/mnt_BaseOS.repo
- Install
git
andansible-core
:
dnf install -y git ansible-core
- Create
ansible
user oncontrol node
and set password:
useradd ansible passwd ansible # Follow prompts to set password
- Add the
ansible
user to thesudoers
file to grant necessary privileges and switch toansible
user:
sudo echo 'ansible ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/ansible su - ansible
- Set up an SSH key pair:
ssh-keygen # Press enter 3x to accept the default file location and no passphrase
To install and set up the project, follow these steps:
- Clone the repository:
git clone https://github.com/Thuynh808/HA-WebTrack cd HA-WebTrack
- Install required Ansible collections:
ansible-galaxy collection install -r requirements.yaml
- Confirm Mount the RHEL ISO:
sudo mount /dev/sr0 /mnt
- Configure inventory
ansible_host
:
vim inventory
Note: Replace IP addresses for all 4 servers and control node's fqdn and hostname according to your setup
-
Run the initial setup script:
./initial-setup.sh
This script prepares our ansible environment:
- Configure /etc/hosts file for all nodes
- Setup ftp server on control node as repository
- Add repo to all nodes
- Ensure python is installed on nodes
- Create ansible user with password: password
- Give ansible user sudo permissions
- Copy ansible user public key to all nodes
- Use rhel-system-roles-timesync to synchronize all nodes
Note: Before installing components, add your slack webhook url for alertmanager to send alerts
-
Edit alertmanager config file:
vim roles/alertmanager/templates/alertmanager_config.j2
-
Execute the main Ansible playbook:
ansible-playbook site.yaml -vv
This command starts the installation and configuration of
ALL
components:- Apache HTTPD on
webservers
group - HAProxy load balancer on
balancers
group - Grafana on
control
node - Node Exporter on
balancers
andwebservers
group - Prometheus on
control
node - Promtail on
balancers
andwebservers
group - LVM storage for Loki logs on control node
- Loki on
control
node - Alertmanager on
control
node
- Apache HTTPD on
Note: You can also run individual playbooks for each component if there's any timeout errors.
- Run individual playbooks:
ansible-playbook playbooks/<playbook_name>.yaml
- Results from running
site.yaml
playbook shows no errors. AWESOME!!!
After installation, verify that all components are running correctly by accessing the following URLs and ensuring that each service is operational:
Server | Service Name | URL |
---|---|---|
Control Node | Grafana | <controlnode_ip>:3000 |
Control Node | Prometheus | <controlnode_ip>:9090 |
Control Node | Loki(through Grafana) | <controlnode_ip>:3000 |
Control Node | Alertmanager | <controlnode_ip>:9093 |
HAProxy (node1) | HAProxy | <node1_ip>:80 |
HAProxy (node1) | Node Exporter | <node1_ip>:9100 |
HAProxy (node1) | HAProxy Exports | <node1_ip>:8405/metrics |
HAProxy (node1) | Promtail | <node1_ip>:9080 |
Web Server 1 (node2) | Web Server | <node2_ip>:80 |
Web Server 1 (node2) | Node Exporter | <node2_ip>:9100 |
Web Server 1 (node2) | Promtail | <node2_ip>:9080 |
Web Server 2 (node3) | Web Server | <node3_ip>:80 |
Web Server 2 (node3) | Node Exporter | <node3_ip>:9100 |
Web Server 2 (node3) | Promtail | <node3_ip>:9080 |
When encountering issues during the Ansible playbook execution, check the Ansible logs for detailed error messages. Ensure all prerequisites are correctly installed and configured before starting the installation. For issues related to specific components, refer to the component's documentation or the troubleshooting section of this guide.
This section showcases key milestones and achievements during the build and testing phases of the HA-WEBTRACK project. The following screenshots illustrate the successful deployment, configuration, and operation of the high-availability web server environment.
- Utilizing jinja2 template and ansbile facts to automate hosts file configurations for smooth and consistent idenitification of our nodes
- Successfully configured ftp server to host our repository
- Using ansible debug module to create a hashed password for user ansible
- Created a landing page with server info for webservers using jinja2 templating
- Confirming
node2
andnode3
httpd service is up and running on port 80
- Confirming
node1
haproxy service is up and running on port 80
- HAProxy metrics are showing on port 8405;
node2
&node3
webservers are up
- Navigating and refreshing on
node1
, we can see both webservers are coming up confirming http requests are load balanced
- Confirming
control node
grafana service is up and running on port 3000
- Navigating to
control node
ip address on port 3000, we can access grafana using username: admin password: admin
- Confirming node exporter is up and running on port 9100 for
node1
node2
andnode3
- Navigating to
node2
metrics, we can see node exporter successfully pulled data from the system
- Confirming prometheus is up and running on port 9090 for
control node
- Navigating to
control node
on port 9090, we can confirm all our nodes are up
- After adding prometheus data source to grafana, we can import a prebuilt dashboard
159
for quick visualization
- A short and simple command to spin up our nodes' cpu
- Our dashboard shows our nodes' uptime, as well as available memory. We can also see the spike of load average and cpu usage from the previous command
- Confirming promtail is up and running on port 9080 for
node1
node2
andnode3
- We can see the different logs being pulled when navigating to promtail's port
- Confirming loki is up and running on port 3100 for
control node
- After adding loki data source in grafana, logs are successfully populated for analyzing
- Alermanager started and running on port 9093
- Navigating to
control node
on port 9093, we can confirm alertmanager is up
- When navigating to alerts section in prometheus port on
control node
, we can see our alerts are up and none active
- In the rules section of prometheus, we have promql queries used to defined our alert rules
- First lets confirm SELinux is set to Enforcing on all nodes
- Here, we'll install httpd-tools where
ab
Apache benchmark can be used to generate http requests to our server for testing
- Next, we increase the maximum tracked connections (nf_conntrack_max) for handling high traffic
- A custom dashboard was created for this project and baseline metrics can now be taken and recorded
HAProxy and Web Server Metrics Summary
- HAProxy Disk Usage: 16.0%
- HAProxy Memory Usage: 12.5%
- Server Uptime: All servers (HAProxy, node2, node3) up for 1.43 hours
- Alerts: No alerts triggered
- Active Backend Servers: 2 (node2, node3)
CPU Usage: HAProxy:
- Min: 0.475%, Max: 1.17%, Mean: 0.782%
- Current: about 1%
Web Servers (node2 and node3):
- Node2: Min: 1.48%, Max: 2.99%, Mean: 2.38%
- Node3: Min: 1.69%, Max: 3.32%, Mean: 2.55%
- Current: about 3%
Load Average:
- HAProxy:
- 1-minute: 0.2, 5-minute: 0.06, 15-minute: 0.02
- Web Servers:
- Node2: 1-minute: 0.22
- Node3: 1-minute: 0.29
Memory Usage for Web Servers:
- Node2: Min: 49.5%, Max: 49.5%, Mean: 49.5%
- Node3: Min: 47.7%, Max: 47.7%, Mean: 47.7%
- Current memory usage: about 50% (stable)
HTTP Request Rate for HAProxy:
- Current request rate: 0
Session Rate for Web Servers:
- Current session rate: 0
HAProxy Logs:
- No data available
Overall Insights:
- Low Traffic: Minimal network traffic and no HTTP requests, suggesting light usage
- Stable Performance: CPU and memory usage on both HAProxy and web servers are low and stable
- No Active Sessions: Both web servers show no active sessions, indicating no load on the system at the moment
- ApacheBench used to generate moderate load with 10,000 requests and 5 concurrent users
- Now lets take a look at the results
HAProxy and Web Server Metrics Summary under Moderate Load
- HAProxy Disk Usage: 16.0% (unchanged)
- HAProxy Memory Usage: 12.9% (slight increase from baseline)
- Server Uptime: All servers (HAProxy, node2, node3) up for 1.49 hours
- Alerts: No alerts triggered
- Active Backend Servers: 2 (node2, node3)
CPU Usage: HAProxy:
- Min: 0.441%, Max: 52.9%, Mean: 2.34%
- Peak CPU usage reached 52.9% during load testing.
Web Servers (node2 and node3):
- Node2: Min: 1.48%, Max: 59.6%, Mean: 3.69%
- Node3: Min: 1.69%, Max: 89.7%, Mean: 4.63%
- CPU usage spiked on both web servers, with node3 seeing a significant increase.
Load Average:
- HAProxy:
- 1-minute: 0.7, 5-minute: 0.18, 15-minute: 0.05
- Web Servers:
- Node2: 1-minute: 0.24
- Node3: 1-minute: 2.26 (significant increase due to high traffic)
Memory Usage for Web Servers:
- Node2: Min: 49.5%, Max: 50.0%, Mean: 49.6%
- Node3: Min: 47.7%, Max: 48.4%, Mean: 47.8%
- Memory usage remained stable on both web servers during the load test.
HTTP Request Rate for HAProxy:
- Peak request rate: 400 requests per second.
Session Rate for Web Servers:
- Session rate peaked during load with node2 and node3 equally handling requests which returned to normal after the test
HAProxy Logs:
- Log entries reflect successful requests and traffic management by HAProxy during the test.
Insights:
- Increased Traffic: The system handled a moderate load of 10,000 requests with a significant spike in both CPU usage and network traffic on HAProxy and web servers.
- Stable Memory Usage: Despite the load, memory usage remained stable on both web servers.
- Peak Performance: Node3 experienced higher CPU load compared to Node2, possibly due to more evenly distributed traffic.
- Evenly distributed requests: node2 and node3 webservers handled an equal number of requests confirming haproxy load balancing effective
- No Alerts: Despite the increased load, no alerts were triggered, indicating that the system is well-configured to handle moderate traffic without failures.
- Now we'll increase the numbe of requeststo 70.000 and 20 concurrent connections for this test.
- After a few minutes, a high request rate alert was triggered and we received a notification in the slack channel
- Navigating to prometheus alerts page, we can verify the the alert in a firing and active state. The alert rule can be verified and reviewed
- Below are the screenshots of our custom grafana dashboard with metrics to observe
HAProxy and Web Server Metrics Summary under High Load
- HAProxy Disk Usage: 16.1% (slight increase)
- HAProxy Memory Usage: 12.6% (minimal change)
- Server Uptime: All servers (HAProxy, node2, node3) up for 1.61 hours.
- Alerts: 1 active alert (High Request Rate)
- Active Backend Servers: 2 (node2, node3)
CPU Usage:
-
HAProxy:
- Min: 0.441%, Max: 71.1%, Mean: 13.1%
- Peak CPU usage spiked to 71.1% during high load.
-
Web Servers (node2 and node3):
- Node2: Min: 1.48%, Max: 72.1%, Mean: 13.9%
- Node3: Min: 1.69%, Max: 98.3%, Mean: 19.3%
- Node3 hit 98.3% CPU usage, indicating that it was handling a significant portion of the load.
Load Average:
-
HAProxy:
- 1-minute: 1.86, 5-minute: 0.77, 15-minute: 0.38
- Significantly higher compared to the baseline and moderate load tests.
-
Web Servers:
- Node2: 1-minute: 2.2
- Node3: 1-minute: 43.7 (major spike due to heavy traffic)
Memory Usage - Web Servers:
- Node2: Min: 49.5%, Max: 50.2%, Mean: 49.8%
- Node3: Min: 47.7%, Max: 59.0%, Mean: 51.3%
- Memory usage increased slightly on both web servers, with Node3 peaking at 59.0%.
HTTP Request Rate - HAProxy:
- Peak request rate: 400 requests per second.
Session Rate - Web Servers:
- Session rate is still balanced 50% between the two webserer nodes
HAProxy Logs:
- The logs indicate successful handling of the increased load, showing HTTP requests and load balancing between the two web servers.
Insights:
- Increased Load: The system handled a much heavier load of 70,000 requests with significant spikes in CPU usage and network traffic.
- High CPU Usage: Both HAProxy and Node3 experienced high CPU usage, indicating that the system was operating near its maximum capacity.
- Triggered Alerts: The High Request Rate alert was successfully triggered, and notifications were sent to the Slack channel, demonstrating effective monitoring and alerting.
- Stable Memory: Despite the high load, memory usage remained stable on both web servers.
- For this scenario, we'll poweroff node3 to simulate a downed instance, run a moderate load test and observe the metrics
- After a few minutes, an instance down alert is triggered. We can see the firing state in the alerts page of our prometheus service
- Consequently, a notification is received via our slack channel
- Below are the screenshots of our custom grafana dashboard with metrics to observe
HAProxy and Web Server Metrics Summary during Failover
- Active Backend Servers: 1 (node2)
CPU Usage:
- HAProxy CPU Usage:
- Min: 0.441%, Max: 80.7%, Mean: 13.6%
- A significant spike occurred when the failover happened, with CPU usage peaking at 80.7%.
Web Server CPU Usage (node2):
- Min: 1.55%, Max: 76.4%, Mean: 14.6%
Web Server CPU Usage (node3):
- Node3 showed no CPU usage after being powered off, as expected.
Load Average:
- HAProxy Load Average:
- 1-minute: 1.86, 5-minute: 0.770, 15-minute: 0.354
Web Server Load Average:
- Node2: 1-minute load peaked at 2.20.
- Node3: Load showed 0 after being powered off, indicating no activity.
Memory Usage:
- Node2: Min: 49.7%, Max: 50.8%, Mean: 50.2%
- Node3: Min: 48.4%, Max: 60.2%, Mean: 57.2%
- Node2 memory usage remained stable as it took over the traffic after the failover.
- Node3 shows a break in the graph confirming the downed instance
HTTP Request Rate (HAProxy):
- Peaked at 400 requests/sec during the failover.
- After node3 was powered off, the request rate dropped temporarily but resumed at node2.
Session Rate (Web Servers):
- Node2 handled the traffic after failover with 188 sessions after the switch.
- Node3 showed 0 sessions after being powered off, confirming the failover to node2.
HAProxy Logs:
- Logs show the switch of traffic from node3 to node2, confirming successful failover behavior, with each HTTP request being rerouted.
Overall Insights:
- Successful Failover: After powering off node3, HAProxy successfully rerouted the traffic to node2.
- Spikes in CPU and Load: Both HAProxy and node2 experienced spikes in CPU usage and load during the transition, but performance remained stable.
- No Downtime: Traffic handling switched seamlessly, indicating a properly configured failover setup.
- For this scenario, we'll power-on node3 while running another load test and observe the metrics. We see instance down alert has been resolved but 2 more alerts have triggered, due to high cpu and high request rates
- The screenshots below show the recovery metrics of our test
- After a successful recovery, we can see no alerts are active and in our slack channel, the previous alerts have been resolved
Failover Recovery Test Summary
Process:
- Node3 was powered off, and traffic continued to flow through node2 without interruption.
- After powering node3 back on, it took a short time before it was active again and load started to balance between node2 and node3.
Alerts:
- High CPU Usage Alert fired when node3 was restored due to high traffic and processing loads before the instance fully recovered.
- High Request Rate Alert also fired during the recovery process.
Metrics During Recovery:
- HAProxy CPU Usage reached a peak of 86.2% but averaged 17.2% by the end of the recovery.
- Web Servers CPU Usage peaked at:
- Node2: Max 95.1% during the failover, recovering to a mean of 16.2%.
- Node3: Max 98.2% after recovery, with a mean of 42.4%.
Session Rates for Web Servers:
- Traffic initially routed entirely to node2 but was distributed between node2 and node3 after node3 was back online.
Resolution:
- Both High CPU Usage and High Request Rate alerts were resolved once node3 fully recovered and the load balanced across both web servers.
System Status Post-Recovery:
- Alerts: No active alerts.
- CPU and Memory Usage: Back to normal levels, indicating stable system performance.
- Load Distribution: Load between node2 and node3 was balanced, with session rates distributed evenly.
The recovery was successful, with the system returning to its expected state, and all metrics stabilizing after the failover and restoration of node3.
This project was a blast from start to finish! Starting with the Ansible build and moving through to the final testing phases, everything came together. The initial playbooks were carefully crafted to automate the setup of high-availability web servers, and throughout the process, best practices were drawn from RHCSA and RHCE principals. Diving into promql to create a custom dashboard specifically for this project was very insightful. Each playbook was converted into roles, making the project super modular and reusable for anyone who wants to duplicate or expand on it. The dynamic variables were the cherry on top, ensuring the entire project is flexible and easy to adapt.
The hands-on testing, from failover scenarios to high-load stress tests, was a real highlight! Grafana, Prometheus, and Loki provided real-time insights, while Alertmanager and Slack informed of any potential issues. It all came full circle with the recovery and stabilization phases. This was more than just building infrastructure; it was a real-world example of how automation, monitoring, and security can work together seamlessly. I’m stoked to have brought it all to life!