World Community Grid - View Thread - Periodic Issues with File Uploads and Downloads [Resolved]

World Community Grid Forums

Category: Retired Forums

Forum: Known Issues [read only]

Thread: Periodic Issues with File Uploads and Downloads [Resolved]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 14

[ ]

Author

This topic has been viewed 18162 times and has 13 replies

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Unstable Server Environment

It has been awhile since we posted an update on this issue.

We have continued to engage the resources of the GPFS support team and they have been digging through trace files created while one of the incidents was occurring. They have found two points of interest that point to how we will reach final resolution of this issue.

First a little background:

We use the IBM GPFS software so that we can mount SAN storage on multiple servers at once. This allows us to scale horizontally (meaning that as more volunteers participate and we grow, we can add more servers to keep up with the load). We have two major directory locations that we use. Your computers connect to the first one to download input files. The second is where your computers upload completed result files. Both the download and upload directories have 1024 sub-directories each in order to balance load and prevent some inefficiencies that occur with large directories.

What they found:

1) The GPFS filesystem is implemented in a way that when storage is allocated to record the list of files in a directory, that storage can never be reduced. We had a situation last year where we accidentally created a significant number of files in some of the download sub-directories. Due to the fact that the directory storage blocks are never shrunk in size, this meant that we had a lot of extra directory blocks in use. GPFS caches these directory blocks in memory in order to optimize performance. However in some cases, one directory was using up to 14 MB of storage for its directory bocks (and thus 14 MB of RAM). About half the download directory blocks had this issue. We did some testing to confirm and these sub-directories only actually require 1MB of storage. Since 512*14MB is a little over 7GB, the cache we have assigned (2GB) needs to evict data frequently to bring in different data blocks. There are two fixes for this:

a) Add more RAM
b) Recreate the directories

We have already created and tested a script to recreate the directories. We ran it on 24 directories and it worked as expected and we saw a significant reduction in the directory blocks used. Next week we plan to run this on all of the oversized directories. We expect this on its own to mitigate the issue. However, we are evaluating the addition of more RAM to see if that would also provide value, room for growth and further stability.

2) In the traces, the GPFS support team also saw that normally the latencies on a read or write request was measured to be about 1 ms on average. However, during these issues, the latencies rose to 10ms and severely impacted the ability of the cache to write dirty blocks to disk and free up cache space to load in new directory blocks. The team is now looking into what is causing this to occur. We now have a way to reproduce the latencies without cause an issue to occur so we are hopeful that we can get to the bottom of this issue without further disruption to our end users.

We believe that both 1 and 2 had to occur together to cause the filesystem lock-up that we have been experiencing. We are quite confident that once we run the scripts to recreate the directories, the issue will be resolved. However, we are also going to resolve the read/write latency issue and examine adding RAM so that we can ensure that this issue is firmly behind us.

THANK YOU for your ongoing patience and support while we have worked through this problem.

[Aug 24, 2012 8:58:56 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Unstable Server Environment

We started running the scripts to rebuild the directories as identified by 1b) above. We have completed running these against the significantly oversized directories and we have seen significantly stability improvement since this finished. In fact we have now gone 36 hours without this issue re-occurring. This is the longest period of time we have gone without seeing the issue since early June. As a result, we have added redundancy back into the environment and we are watching it closely to ensure that it remains stable. However, we are very optimistic at this point.

We are now running the script against the directories that are only modestly oversized (a factor of 2-3 larger than required). This will save us additional 700MB or so of required cache size for caching these directory structures. This will give us improved performance and further reduce concerns of a repeat.

Additionally, we are moving to install additional RAM on these servers. This will allow us to increase the cache size to further improve the performance of GPFS and provide additional insurance against this issue re-occuring again in the future.. We hope to install the memory by end of week. We will be able to complete this without impacting the users.

[Aug 28, 2012 4:19:23 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Unstable Server Environment

We have finished running the script to rebuild the directories and have also restored our redundant configuration for the servers in the environment. We have not seen a re-occurrence of this issue since Sunday so we are pleased to see that the steps we have taken are addressing the issue.

Today, starting in a short while, we will be increasing the RAM on the servers and giving GPFS more cache space to work with. While not strictly required to resolve this issue, investigation during the issue revealed that the cache was barely sufficient for the performance we need. Adding this additional RAM is prudent to ensure high performance as we continue to grow.

[Aug 30, 2012 12:37:46 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Unstable Server Environment

We have completed the work to add more RAM on the servers and give the GPFS filesystem more cache space to work with. This in addition to the directory size changes has resolved this issue.

We continue to investigate the issues detected with the SAN performance and latency. However, as we had hoped, resolving one of the two identified issues allowed the system to resume normal operations so we are marking this issue as resolved.

[Sep 5, 2012 5:33:28 AM]

[ ]