Software Stack
cism |
Each year, we publish an activity report that presents a summary of our day-to-day activities and of our projects. Every year, it follows roughly the same structure, and presents the same tables and graphs, updated. It is written in collaboration by all the CISM members.
Given the recurrent structure and information, and the fact that it is a shared work among us, we have decided to use LateX to author the report, giving us much flexibility in producing a nice-looking document, using a custom style file. And the fact that LateX is purely text-based allows two important aspects:
- it can be versioned under a version control system; and
- it can be (partly) created by automated tools.
LateX is a very good system for such purpose, but its syntax is sometimes intrusive and un-intuitive. Therefore, we have organized the structure of the document in LateX, but the individual paragraphs are written in Markdown, and converted to LateX using Pandoc and a simple Makefile. The whole thing is versioned thanks to Mercurial.
We have developed scripts that fetch data from our databases (e.g. GLPI for the inventory, Zabbix for the load of the clusters, or Slurm for the cluster usage) and write the LateX code to produce tables or graphs (with Tigz.) This way, a lot of manual copy/pasting is avoided, saving time and eliminating potential errors. The list of publications is integrated directly from Dial.pr thanks to an extraction tool written by Étienne Huens (SST/INMA).
Using text files rather than a Word/OpenOffice document also allows automatic processing of the text. Prior to compilation, we have several scripts running to ensure consistency in capitalisation for instance, or to add links when named entities (e.g. "SST/CISM") are discovered.
This approach is saving us each year several man.hours leveraging the features available in Mercurial and the custom scripts we have developed. But of course, when our infrastructure changes, we have to adapt our scripts. Fortunately our infrastructure rarely changes ; but that can happen when, for instance, we move all our computers into a brand new computer room.
On our multi-purpose compute cluster Manneback, we use a trilogy of open-source tools to setup our compute nodes, namely Cobbler, Ansible and Salt.
When we acquire new hardware, the workflow proceeds as follows once the
hardware is in place:
1. The MAC addresses of the nodes are collected, and the names and IP addresses are chosen. To collect the MAC addresses, we often simply look into the logs of the DHCP server once we have booted the nodes.
2. The above information (MAC, IP and hostname) is entered in the Cobbler system and a kickstart is created to handle the network configuration and the disk partitioning. Some necessary packages are also installed at that time, most important of all is the salt-minion. The salt minion is also configured through the kickstart.
At this point, the node is ready to be deployed.
3. The node is then restarted, and its operating system is installed through Cobbler. When the kickstart runs, and the salt minion starts and registers itself to the Salt master.
At this point, the node is ready to be integrated.
4. Then, an Ansible playbook is run, that takes care of creating the configuration files and to register the nodes to the inventory and the monitoring system. Ansible gets its inventory from the Salt server and only operates on nodes whose keys are not yet added to the Salt master. More precisely, the playbook gathers the public SSH key to create a proper hosts.allow for host-based SSH authentication, it gathers node information to build the Slurmconfiguration file, and it registers the node to the OCS inventory and to Zabbix.
At this point, the node is ready to be configured.
5. Finally, the salt master issues a "state.highstate" to propagate the configuration files, install the necessary packages, mount the network filesystems, and the nodes are ready for use.
It may be a bit surprising at first sight to use both Ansible and Salt ; most system administrators will consider using one OR the other. But we found it more efficient to combine their respective forces.
The reasons why we would not work with Salt only:
- Ansible has many more modules to interact with other systems and to manipulate text files;
- Ansible allows easier control of operations (orchestration) when synchronisation between services is needed ;
- Ansible is better designed at handling "one shots" such as registering to the monitoring system.
Conversely, we found the following reasons why we would not work with Ansible only:
- Salt's main mode of operation (pull) does a better job at handling nodes that are down;
- Salt scales better when several hundred nodes need to be configured at the same time;
- Salt makes it easier to write declarative configuration code based on rules and dependencies.
In the end, what we have is a combination of three free open-source tools working hand in hand and allow us bringing new nodes to life in minutes in our compute cluster.
We moved in a brand new data center in April 2016 and the first machines we installed in there after the switches were the four computers we dedicated to our private Openstack. We needed to deploy those computers at a time the top-of-rack switches were not yet connected to their uplink switches. But the WiFi from the building next door was within reach.
So we bought a Raspberry Pi 3, installed Fedberry, configured it as a gateway router, and set up a Cobbler server. Once the four nodes were properly physically patched and connected to the various internal networks, we deployed them. We chose CentOS7 for the operating system, and used the functionalities of the Kickstart to setup the network interfaces, configure IPMI, install useful packages, set root password and SSH keys, etc. Only the first two nodes, which are management nodes, have direct network connectivity with the external networks. They provide masquerade/NAT for the other two nodes which host the virtual machines and virtual networks.
Once they all were up and running, we cloned the Mercurial repository that holds our Ansible playbook and roles to deploy Openstack, on the first computer.
One role installs Gluster on all four nodes (distributed/replicated). The Gluster filesystem holds the virtual machines (instances), the virtual disks (block volumes) and the OS images. Another role sets up Galera with MariaDB, also on all four nodes, and a third one installs a RabbitMQ cluster. These lay the foundation of the Openstack install, upon which all Openstack services rely.
The two management nodes are redundant and share a virtual IP, managed with KeepAlive. They both host a HAProxy service for all the Openstack modules. Both HAProxy and Keepalived are installed by their respective roles.
Then come roles for the Openstack key elements: the identification module: Keystone, the image manager: Glance, the virtual machine handler: Nova and the virtual network handler: Neutron. After that, other roles took care of Cinder for virtual block devices, and Horizon, Openstack's web interface. All Openstack modules are installed in an active-active redundancy mode, except for Neutron, that is active-passive.
Finally, roles for the ELK stack (Elasticsearch, Logstash, Kibana) are run to collect all Openstack log files. Zabbix finalizes the installation. Note that as Zabbix is also redundant, we get all alerts twice... But that is better than receiving none.
The Ansible playbooks and roles were developed on a virtual environment with VirtualBox and Vagrant. Every aspect has been tested beforehand except for some configuration of Neutron because we could not reproduce in Virtual Box the physical environment exactly. But it was sufficient to be confident our playbooks would do the job. And indeed, the whole setup was up in a couple hours.
Once the Openstack cloud was operational, a set of shell scripts and Ansible playbooks provisioned all our virtual machines. The first one we took care of was our Salt server that is subsequently used to configure all the other VMs: DNS server, LDAP servers, Zabbix server, OCS server, etc.
The CISM software stack
At CISM, we manage a few hundreds computers so we need to use a comprehensive set of tools to handle the burden of deploying, configuring, monitoring and repairing all those computers. For reference, here is the software stack we use at CISM, from Ansible to Zabbix.
Elasticsearch We use an elasticsearch cluster to store the logs of our compute nodes and service nodes. Based on Apache's Lucene search engine, it is unbeatable at indexing texst documents. Logstach Logstach is used to populate the elasticsearch clusters with the logs produced by all the services we run. With specially-crafted regexes, it parses the logs before it pushes them to the elasticsearch server. Kibana We use Kibana to browse
and search the log database. Its filtering and plotting facilities make it very easy to navigate the logs and correlate events for troubleshooting.
-
Lustre is used on Lemaitre2 for the fast, infiniband-connected, scratch space common to all nodes. Lustre is probably the market leader for fast parallel filesystems and offers very good performances.
-
GPFS. IBM's GPFS (now called Spectrum Scale) is a very versatile, feature-loaded distributed and parallel filesystem that we use at the CÉCI level for the central storage shared by all compute clusters.
-
Glusterfs is used for home directories where data safety is more important than performance. The fact that it is 'metadata-less' and that the files can be accessed directly even if the Gluster filesystem is down is very reassuring.
-
ZFS For the long-term mass storage, we favour ZFS to store large amounts of data safely. We like the fact that everything is stored on the disks. Several times, we have migrated 30-disks ZFS filesystems from one OS (Solaris) to another (GNU/Linux), or from one hardware controller to another.
-
FHGFS The Fraunhoffer filesystem is used on our clusters for the scratch space where performance is more important than data safety. It is easy to setup (compared with Lustre for instance).
-
Ceph is an object store that can offer remote block devices and a parallel filesystem. Ceph is currently being studied as a mean to converge all non-performance storage types in a single architecture based on commodity/old hardware. Example usage include CephFS instead of ZFS for mass storage, Ceph block device for virtual machines instead of Gluster, and high-availability block storage for management services (e.g. Slurm) rather than DRDB. It will also serve as an object store for users who request it in the future.