Unstructured data is proliferating massively. It is growing in volume by more than 50% a year and, according to IDC, it will form 80% of all data by 2025 and already does so for some organizations.
That means unstructured data is a potential storage headache, but it’s also a valuable source of intelligence.
There’s another 80% figure that flies around unstructured data, which is that four-fifths of all business-relevant information originates from unstructured data, mostly text.
In other words, it’s in emails, reports, articles, customer reviews, customer notes, and other forms of unstructured text. It is also found in social media posts, medical research findings, videos, voice recordings, and remote systems (Internet of Things) monitoring data. In other words, unstructured data is highly varied and can range in size from a few bytes to very large.
Therefore, whether or not the 80% figures are accurate, they highlight the importance of unstructured data.
In this article, we’ll look at the wide variety of unstructured data, the structures that exist in unstructured data, NAS and object storage, and cloud services that target unstructured data.
There is no one size fits all in terms of storage
In terms of size and format, unstructured data can comprise everything from monitoring data from remote Internet of Things (IoT) systems to video. That encompasses file sizes ranging from a few bytes to several gigabytes or more. In between, there is a lot of text-based data that is derived from emails, reports, customer interaction, etc.
To define it, we can say that it is the type of data that is not kept in the structured format that we associate with a traditional relational database. Instead, it could reside in any form between raw data and some kind of NoSQL database, actually encompassing a range of products / methods for sorting data that go beyond the traditional SQL way of doing things.
The type of storage required depends on two things. We are not talking here about the database in use, but the storage it is in.
The requirements here refer to your capacity, but also to the I / O requirements that your organization will place on you.
Therefore, unstructured data storage could be anything from relatively low volume and low I / O performance, such as NAS, object storage device, or cloud instance, to huge distributed file storage. or high performance objects.
Not as unstructured as you might think
“Unstructured” may be a misnomer. In fact, you could view existing unstructured data on a continuum. At one extreme would be things like IoT data, emails, documents, and possibly some less obvious candidates like voice and video that have metadata headers or come with formats (XML, JSON) that allow for basic analysis. These are semi-structured data.
At the other extreme, there would be a large amount of text sourced from websites or social media posts that would be the most difficult to analyze and process.
It is beyond the scope of this article to go into detail about data lakes, warehouses, marts, swamps, etc., and the methods for sorting data within them, such as NoSQL.
The key decision from the first point remains: the back-end storage will depend on the required capacity and access times, the I / O profile, and the potential availability and ability to scale.
NAS is not what it used to be. Scale-out NAS has brought file access storage into the realm of very high capacity and performance. NAS used to mean a single filing cabinet, and that meant the potential to be isolated.
Scale-out NAS is built with a parallel file system that provides a single namespace across multiple NAS boxes with the ability to scale to billions of files. Capacity can be added and, in some cases, processing power as well.
Scale-out NAS has the advantage of being Posix compatible, so it works well with traditional applications and benefits from features such as file locking, which can be important from an access point of view.
Scale-out NAS was also recently the only option for high-performance unstructured data, although object storage is catching up.
Local scale-out NAS storage is available from the top five physical storage array manufacturers: Dell EMC, NetApp, Hitachi, HPE, and IBM. They also have ways to tier data in the cloud, and in some cases offer cloud instances of their NAS products.
The big three cloud providers, AWS, Azure, and Google Cloud, provide file storage ranging from standard to premium service tiers, often based on NetApp storage.
There is also a new generation of file storage products designed for use in the hybrid cloud. These include Qumulo, WekaIO, Nexenta, and Hedvig. Elastifile was among these, but was bought by Google in 2019.
Object storage is a newer competitor to the crown of unstructured data storage. Keeps data in a flat format accessed through a unique ID, with metadata headers that allow for search and some analysis.
Object storage gained traction as an alternative to some of the drawbacks of scale-out NAS, which can suffer performance impacts as it grows due to its hierarchical structure.
Object storage is also possibly the native cloud format. It’s hugely scalable and accessible via application programming interfaces (APIs), which fits in well with the DevOps way of doing things.
Compared to file storage, object storage lacks file locking and was, until recently, lagging behind in terms of performance, although that is changing and is driven by the need for rapid analysis of unstructured data.
The big five make object storage for local use, with ways of tiered object storage in the cloud. In addition, there are open source object storage specialists such as Scality, Cloudian, Quantum, Pure Storage and Ceph.
All of the basic storage offerings from the major cloud providers are based on object storage, offering different classes of service / performance. AWS, for example, offers different classes of S3 storage that vary based on access time requirements and the value or reproducibility of the data.
Containers and benefits of the cloud
The big three cloud providers offer their core object storage services for use as data lake storage.
Microsoft offers a specific service that will handle unstructured data, Azure Data Lake.
The benefits here are that the cloud provider offers expandable capacity and the means to get data through gateways etc. The downside of course is that you have to pay for it, and the more data you enter the data lake, the more it costs.
Additionally, hyperscalers offer NoSQL databases in their clouds. These can be your own (Google Datastore, Amazon DynamoDB, Azure Cosmos DB) or third-party NoSQL databases that can be deployed in your clouds.