The Storage Systems Administrator in the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University serves as systems administrator for the ACCRE cluster team that helps manage the >10,000-core Linux cluster and the 3 PB (usable) cluster GPFS filesystem. This position will report to the Director of Research Computing Operations.
Computing is emerging as a third paradigm for discovery, complementing theory and experiment. To quote from a recent National Research Council report: "The exploding technology of computers and networks promises profound changes in the fabric of our world... As seekers of knowledge, researchers will be among those whose lives change the most... Researchers themselves will build this New World largely from the bottom up, by following their curiosity down the various paths of investigation that the new tools have opened. It is unexplored territory...." The Advanced Computing Center for Research and Education (ACCRE) is being built and operated by Vanderbilt faculty. Its mission is to allow Vanderbilt researchers to define, benefit from, and explore the "New World" described above. Towards this aim, the center has established the following goals:
Low Barriers: Provide computational services with low barriers to participation, working with researchers to develop and adapt HPC tools to their avenues of inquiry.
Expand the Paradigm: Work with members of the Vanderbilt community to find new and innovative ways to use computing in the humanities, arts, and education.
Promote Community: Foster an interacting community of researchers and develop a campus culture that promotes and supports the use of HPC tools.
The center manages an over 10,000 processor Linux cluster comprised of multiple computer architectures and over 15 PB of disk storage.
Duties and Responsibilities
This position will be responsible for critical aspects of the following projects:
ACCRE Storage Systems Administration
Maintain, administer, and improve ACCRE's storage services.
Troubleshoot hardware and software problems related to the storage.
Be the primary support for remote user access to the storage systems.
Support tape archive and data backup.
Be a member of the team developing, deploying, and supporting a distributed NAS system.
Work on adapting existing software tools to support the transport and management of research data between various storage pools both on and off campus. This will require additional work packaging and adapting tools for the various research communities.
ACCRE Compute Cluster Administration
Set up/configure cluster hardware, including gateways, compute nodes, and cluster management infrastructure.
Install operating system and related utility software.
Monitor the status of the cluster utilizing tools such as Nagios, including customizing the tools for ACCRE-specific needs.
Compile/install application software packages needed by researchers.
Assist with the administration of the cluster job scheduler, including modifying user limits, creating/modifying/deleting node reservations, and diagnosing issues with the job scheduler.
ACCRE Project Objectives
Serve as a technical resource to users and other ACCRE staff members.
Plan work for other team members to meet project guidelines.
Train and supervise other team members, as needed, and act as internal technical consultant to ACCRE staff, particularly related to projects on which this position is serving as the lead systems administrator.
Respond to help desk tickets to solve user problems and to educate users on cluster usage.
On a rotating basis, serve as the on-call person for evening and weekend hours, such as a rotating 4-week schedule or every other week in a Level 2 support rotation.
Work nights and weekends as needed for scheduled or unscheduled downtimes.
Compile documentation in a timely manner for all ACCRE projects and tasks, both for new projects and for changes to ongoing projects.
Physically move and lift hardware when needed.
Actively identify and participate in training, education, and development activities to improve knowledge and performance and to sustain and enhance professional development.
Keep up-to-date on software systems, operation procedures, and technological developments in systems, high performance computing, and programming.
Research, design, and evaluate new technologies/concepts that could potentially improve ACCRE's capabilities and/or services.
Attend meetings, conferences, and seminars in systems, high performance computing, and programming, and in particular regularly attend and participate in OSG meetings. Give presentations on ACCRE services at conferences when requested.
Work with outside companies to improve ACCRE services. Develop partnerships with vendors and service providers. Work with both software and hardware developers to implement needed customizations specific to our site requirements.
Supervisory Relationships This position does not have supervisory responsibility; this position reports administratively and functionally to the Director of Research Computing Operations.
A Bachelor's degree from an accredited institution of higher education is necessary.
A Bachelor's degree in computer science or computer engineering from an accredited institution of higher education is strongly preferred.
Five years of experience with system administration with UNIX based operating systems is required (ten years preferred).
Five years of experience with clustered storage solutions (GPFS preferred).
Five years of experience with programming/scripting.
Self-driven, inquisitive, and productive troubleshooting abilities.
Strong ability to work individually and in a team environment.
Commitment to continuous improvement.
Ability to rapidly adapt to new technological dynamics.
Strong ability to share knowledge coherently with others.
Ability to work independently and make critical decisions.
Ability to gain knowledge/skills independently and from team members.
Willingness to work outside of one's comfort zone effectively.
Demonstrated success in taking initiative, meeting deadlines, adapting to changing priorities, and managing multiple projects simultaneously.
Familiarity and experience with Disk storage hardware (constructing and deployment): SAN, NAS, JBOD, RAID (all levels), and RAID controllers preferred.
Experience with any of the following storage filesystems is a plus: IBM SpectrumScale, OpenAFS, AuriStor, NFS, SAMBA, ZFS.
ElasticSearch experience a plus.
Strong working knowledge in the following system administration areas:
Advanced knowledge of Unix/Linux operating systems, especially Red Hat-based systems.
Advanced knowledge of compute, storage, and networking related hardware.
Experience with large-scale distributed systems.
Proficient programmer in Bash and optionally Python.
Configuring, building, and installing Linux kernels and modules.
Experience with benchmarking software: Bonnie++, NetPerf, etc.
Knowledge and experience in version control tools (preferably Git).
Knowledge and experience with configuration management tools (preferably CFEngine): CFEngine, Salt, Chef, Puppet.
Commitment to Equity, Diversity, and Inclusion
At Vanderbilt University, we are intentional about and assume accountability for fostering advancement and respect for equity, diversity, and inclusion for all students, faculty, and staff. Our commitment to diversity makes us who we are. We have created a community that celebrates differences and lets individuality thrive. As part of this commitment, we actively value diversity in our workplace and learning environments as we seek to take advantage of the rich backgrounds and abilities of everyone. The diverse voices of Vanderbilt represent an invaluable resource for the University in its efforts to fulfill its mission and strive to be an example of excellence in higher education.
Vanderbilt University is an equal opportunity, affirmative action employer. Women, minorities, people with disabilities and protected veterans are encouraged to apply.
Please note, all candidates selected for an offer of employment are subject to pre-employment background checks, which may include but are not limited to, based on the role for which they have been selected: criminal history, education verification, social media review, motor vehicle records, credit history, and professional license verification.
Internal Number: 10000583
About Vanderbilt University
Vanderbilt University is a center for scholarly research, informed and creative teaching, and service to the community and society at large. Vanderbilt will uphold the highest standards and be a leader in the quest for new knowledge through scholarship, the dissemination of knowledge through teaching and outreach, and the creative experimentation of ideas and concepts. In pursuit of these goals, Vanderbilt values most highly intellectual freedom that supports open inquiry, equality, compassion, and excellence in all endeavors.