The idea that data is growing exponentially seems to be a universally accepted truth among IT pros. Even so, the vast majority of conversations about data growth seem to center on structured data. However, unstructured data (file data) is also growing at an unprecedented rate. As such, the techniques that have long been used to store and organize unstructured data are quickly becoming inadequate.
While I seriously doubt that many people would deny the idea that unstructured data is growing, it’s easy to underestimate the magnitude of the data growth. Out of curiosity, I took a look at the contents of my own file repository and found that it contains roughly about 360,000 files. Although that might not seem like a huge number, especially by enterprise standards, I am only one person. Furthermore, that number doesn’t include operating system files, nor does it include my data archives. If I can create 360,000 files (consisting of mostly documents, screen captures, and videos) by myself, just imagine how many files could be created in a large organization.
There are of course any number of measures that administrators have put into place to control the accumulation of files. The Windows File Server Resource Manager, for example, can be used to classify files and to keep users from saving various file types to a network file share. For instance, an administrator might choose to block audio or video files. Similarly, many organizations leverage mechanisms such as user quotas or data lifecycle management policies to prevent excessive numbers of files from accumulating on the network.
On one hand, I completely understand why these and other similar techniques are used. There is a direct cost associated with data storage, and keeping data growth in check also helps to control costs. At the same time, though, organizations are increasingly discovering that their seemingly mundane data contains previously hidden business value that is just waiting to be unlocked. If data is potentially valuable to the business then it makes little sense to place stringent restrictions on users’ ability to save files or to force the removal of aging data.
Of course, the unconstrained growth of unstructured data presents other problems beyond storage costs. The biggest challenge might be that of keeping the data organized. Organizations must consider how best to help users locate the one file that they need when it is stored among millions of other files within the file system.
Folder-based taxonomy as a file system solution?
Traditionally, the best way to keep data somewhat organized has been to build a taxonomy into the directory structure. In my own organization, for instance, I use top-level folder names that generally describe the folder’s contents. These names include things like Articles or Business. From there, I create a series of subfolders that help to better organize the information. For instance, I have a Books folder for the books that I have written. I further organize the information within the Books folder by creating subfolders for the publisher, year, and book title.
The folder-based taxonomy that I have created in my own organizations works (most of the time), but it probably wouldn’t scale all that well. If there were other users on my network, there is no guarantee that those users would save their files in the correct location based on the established folder structure.
The larger problem with using a folder-based taxonomy is that no matter how organized the folder structure might be, some things are going to be difficult to locate. A few days ago, for example, I needed to find the schematic diagram for a thermal imaging camera that I built several years ago. I didn’t have a top-level folder for the device, and I couldn’t remember exactly when I built the device. I ended up having to resort to using the Windows search interface.
Thankfully, my file server’s contents are fully indexed, but it still took a lot of time to sift through the search results. Using the word “camera” as a search term returned results from articles on apps that take advantage of smartphone cameras, a receipt from a DSLR camera that I bought a few years back, and countless other documents that casually included the word camera.
The reason why this type of search is so problematic is because the results are based on the filename and/or the file’s contents (such as the words in a document file). The search process would likely be far more efficient if you were able to perform a keyword search instead.
What about object storage?
Public clouds such as Amazon AWS and Microsoft Azure solve this problem by using object storage instead of block storage. Object storage is a flat (as opposed to hierarchical) file storage system and is designed for massive scalability. Rather than organizing files into folders, files can be tagged with metadata, thereby allowing files to be self-described and easy to find.
The use of metadata has its advantages, but it isn’t perfect. SharePoint, for example, has for many years had the ability to apply metadata tags to documents within a document library. There are at least two issues with the use of metadata tagging.
First, there will inevitably be some users who leave tags empty. While you can require tagging, there may be users who enter garbage into the metadata fields in an effort to bypass what they consider to be an annoying or completely useless requirement.
The bigger issue is that the tagging structure does not apply equally well to all documents. Consider, for example, the articles that I write. I might apply tags such as Publication Date, Editor, or Subject. As previously mentioned, though, articles are not the only type of data on my file system. Imagine that I needed to save a copy of a receipt for tax purposes. It probably wouldn’t make much sense to apply the Publication Date or Editor tag to the receipt.
I suspect that the file system of the future will use a combination of metadata tagging and grouping. Similar types of content could be grouped, and then a tag-based taxonomy could be applied to each group, with the tags being specifically designed to match the group’s purpose. I’m not talking about creating a separate bucket for each file type or organizing SharePoint files into separate document libraries. I’m talking about a classification system that is built into the file system, and that can be applied in a manner that is similar to an access control list.
Next-gen file systems: Archiving and accessing
Although my concept of next-generation file systems is completely theoretical, it addresses some of the major challenges involved in keeping data organized. Regardless of what tomorrow’s file system ultimately looks like though, it will almost certainly need to include data reduction technologies such as deduplication (which already exists in the NTFS file system), and a transparent archiving feature. Such a feature would allow aging, seldom accessed data to be transparently moved to an archival system while continuing to allow the users to access the data in the normal way should the need arise. Again, there are products that can do this today, but I am talking about embedding such capabilities into the file system.
Featured image: Shutterstock