ISG News has posted a new item, 'The Art of Scaling'
Note: this is a purely anecdotal posting about our struggles with some
performance bottlenecks in the last few months. If you're not interested in such
background information, just skip.
You might have noticed that since about January 2012 using our file and mail
servers hasn't been as smooth as usual. This posting will give you some
background information concerning the challenges we encountered and why it took
so long to fix them. Let's begin with the file server.
Way back in the days (i.e. 5 years ago), when the total file server data volume
at D-PHYS was about 10 TB, we used individual file server to store this data.
When one server was full, we got a bigger one, copied all the data and life was
good for another year or two. Today, the file server data volume (home and group
shares) is above 150 TB and growing fast and this strategy doesn't work any
longer: individual servers don't scale and copying this amount of data alone
takes weeks. That's why in 2009 we started migrating the 'many individual
servers' setup to a SAN architecture in which the file servers are just huge
hard drives (iSCSI over Infiniband, for the technically inclined) connected to a
frontend server that manages space allocation and the file system. The same is
true for the backup infrastructure, where the data volume is even bigger.
This new setup had to be developed, tested and put in place as seamlessly and
unobtrusively as possible while ensuring data access at all times (apart from
single hour-long migrations). The SAN architecture was implemented for Astro in
December 2010 and has been running beautifully ever since. In 2011 we laid the
groundwork to adopt this system for the rest of D-PHYS's home and group shares
and after a long and thorough testing period the rollout happened on January 5,
2012. Unfortunately, that's when things got ugly.
At first, we noticed some exotic file access problems on 32bit workstations. It
took us some time to understand that the underlying issue was an incompatibility
with the new filesystem using 64-bit addresses for the data blocks. As a
consequence we had to replace the filesystem of the home shares. Independently
we ran into serious I/O issues with the installed operating system, so we had to
upgrade the kernel of the frontend server and move the home directories onto a
dedicated server. In parallel, we had to incorporate some huge chunks of group
data while always making sure that nightly backups were available. All this
necessitated a few more migrations until we finally achieved a stable system on
March 28.
The upshot: what we had hoped to be a fast and easy migration turned out to
cause a lot of problems and take much longer than anticipated, but now we have a
stable and solid setup that will scale up to hundreds or even thousands of TB of
data.
See live volume management and usage graphs for our file servers.
As for the mail server, matters are to some extent related and partly just
coincidental in time. The IMAP server does need access to the home directories
and hence also suffered when their performance was impaired. But even after
having solved the file server issues, we still saw single load peaks on the IMAP
server that prevented our users from working with their email. Again, we put a
lot of time and effort into finding the reason. As of April 13, we're back to
good performance and arrive at the following set of conclusions:
Particular issues:
a covertly faulty harddisk in the mail server RAID seems to have impaired
performance
CPU load of the individual virtual machines on the mail server was not
distributed across the available CPU cores in an optimal way
General mail server load:
while incoming mail volume doesn't increase much, outgoing mails have grown 50%
in the last year alone
more and more sophisticated spam requires more thorough virus and spam scanning,
increasing the load on the mail server
our users have amassed 1.1 TB of mail storage (up from 400 GB in January 2010),
which need to be accessed and organized
Bottom line:
We'd like to thank you for your patience during the last 4 months and apologize
for any inconvenience you might have had to endure. In all likelihood the
systems will be a lot more stable in the future, but of course we're constantly
working to ensure the D-PHYS IT infrastructure is able to keep up with the fast
growing demand of disk space (the data volume has tripled in the last year
alone). We've learned a lot and we'll put it to good use.
You may view the latest post at
https://nic.phys.ethz.ch/news/2012/04/19/the-art-of-scaling/
You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Christian Herzog
daduke(a)phys.ethz.ch
ISG News has posted a new item, 'Mail Server Maintenance Downtime this Evening'
For some hardware and other maintenance we schedule a downtime of our mail
server today (Fri, 13th of April 2012) evening after 6pm.
The downtime will likely take less than one hour. During the downtime you will
neither be able to access your mails on the server nor to send mails via our
server. Mails which are sent to the Dept. of Physics won't get lost, but will
have some lag.
You may view the latest post at
https://nic.phys.ethz.ch/news/2012/04/13/mail-server-maintenance-downtime/
You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Axel Beckert
beckert(a)phys.ethz.ch
ISG News has posted a new item, 'Temporary SMB access restriction'
Last night a security problem was detected in the SMB server software we use for
our group and home shares. In order to protect your data and our systems, we
temporarily restrict access to our group and home shares to the ETHZ IP address
range
until security updates are available. If you're outside the ETH network and need
to access your data, use VPN. We expect the updates to be released later today
or tomorrow and will then go back to world wide access.
You may view the latest post at
https://nic.phys.ethz.ch/news/2012/04/11/temporary-smb-access-restriction/
You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Christian Herzog
daduke(a)phys.ethz.ch