Table of Contents
Dark data is a major issue in many companies yet, not all companies are aware of it. In the course of their activities, companies process and store large amounts of data which they later use to base their business decisions. But according to a recent study by IBM, organizations use only 12% of the data that they collect, while the rest of it stays in the shadows. This unused information can cause serious problems for companies and even become the reason for a security break. However, it could become a useful aid for business decisions or for the optimization of operations if managed properly.
Now: what is dark data?
According to Gartner, dark data is the “information assets organizations collect, process and store during regular business activities, but cannot use for other purposes” (for example, in analytics or for direct monetizing). In other words, it is hidden from view and is hard to be accessed or analyzed properly because the organization may not even know the information is being collected and exists.
Dark data can include:
- Customer information;
- Log files;
- Financial statements and outdated versions of documents;
- Notes and presentations;
- Emails and email attachments;
- Call-center transcripts and customer reviews.
Now that we are clear with the dark data definition, we can look at its main types:
- Untapped Internal Data: the information that organizations collect, store, and process, but do not use more than once or for anything else except for its single purpose.
- Non-traditional Unstructured Data: this type of information is usually attached and/or related to audio, video, and image files. Thus, it cannot be properly explored without special technologies, such as computer vision, advanced pattern recognition, or video and sound analytics.
- Deep Web Data: is often hidden behind firewalls and requires specialized tools or techniques to collect and analyze it.
You might also find these articles interesting:
- Data Scientist vs Data Engineer: What’s the Difference?
- Understanding Data Science and the Role of Data Scientists
- The Best Practices On How to Build a Data Warehouse
Reasons why dark data occurs
The information can be unusable because of different reasons, such as its location, excessive quantity, or lack of resources needed to collect and/or analyze it. It also often happens that companies generate and collect much more information than they are able to process. Let’s look at all the reasons in a bit more detail.
Lack of access
Most dark data consists of information that is no longer accessible. People continuously store data on their private and company devices (i.e. USB sticks, mobile devices, or portable hard drives). So when this device is lost or when the login credentials are forgotten, the access to data is blocked and information may be lost forever.
Another example is a proprietary file format that requires a specific program to read it. It could happen that this program can no longer be used or is no longer available in the required version. Hence, the information remains trapped.
Lack of processing resources
Companies collect enormous amounts of data but they don’t analyze all of it due to the lack of needed processing tools. Besides, some information is available in formats that require specific tools for analysis, such as image files and spoken text in audio files. Or there might be simply a lack of knowledge and skills on how to integrate the existing information and deliver valuable insights from it.
To resolve this issue, companies need to deploy sophisticated and corresponding tools for analysis. But in addition to that, companies will also need individuals with significant data science experts who, in turn, might be difficult to find.
Lack of data governance
It often happens that different departments within an organization have their own data collection and storage processes, which may not be shared with other departments. So the information collected by one department will be unseen and unused by others even if it is relevant and can be valuable to an organization as a whole. When you don’t have proper data governance, there’s a higher chance of your organization operating in silos, which can lead to serious issues and inaccurate business decisions.
The problems that dark data might cause
Although dark data holds potential value for organizations, it also causes some problems if not handled timely and effectively. Here are the main concerns to keep in mind.
As the amount of collected information grows, it requires more storage space. And if you don’t use your existing information, it turns into junk that is taking up valuable space and eats up your resources. This is when companies need to prioritize which information to utilize and which to push aside. Because as the storage space keeps growing, the storage costs increase as well.
Hidden valuable information
Companies often don’t even know what type of information they’re storing. However, dark data may contain valuable insights that an organization has always been looking for. Hence, it equals lost opportunities in terms of business decisions, a lack of holistic view of the processes, and as a result, lost revenue and resources.
You can use dormant business data to mine essential insights and patterns in internal processes and user behavior for continuous improvement. Hidden information can be crucial for business insights, as companies could lose to competitors by ignoring it.
The part of your dark data that is not securely hidden is the most vulnerable to leaks and theft. It is very easy for hackers to access systems that use outdated software components. So it’s important to know if any business-critical information is in your dark data warehouses. In order to secure your organization from possible cyber threats, it is essential to know your inventory and properly safeguard it.
Ways to manage dark data
The more information a company produces, collects, and holds across multiple systems, the more important it is to have a solid data management strategy in place. You can start implementing the following practices in the IT department to help you detect dark data and derive value from it.
Use data retention policies
Organizations use data retention tools to create and enforce storage and security policies for prescribed periods. Such policies determine which types of data should be kept and which should be deleted. In case the information is meant to be deleted when the period for its retention expires, the policies outline specific manners to do so securely.
You also must consider the legal implications of this task. For example, if data covered by a specific mandate or a regulation appears anywhere in dark data collection, its exposure could involve legal and financial liability. So keep in mind local and international privacy requirements.
As well, data retention policies encourage organizations to look through their databases and double-check if there is any important information they didn’t recognize at first. This allows closing the gap for any missed opportunities.
Frequently audit your data
Finding and classifying unknown data is crucial for organizations’ privacy and compliance initiatives. After all, you can’t protect what you don’t know you have (or what level of protection is needed). So to shed some light on hidden resources, you can use specific tools to help you out.
- DeepDive: a tool developed at Stanford University which extracts data relationships and makes corresponding inferences. DeepDive uses machine learning to convert dark data into a structured format that can be combined with existing data sets.
- Snorkel: a system where a user can create large training sets by writing simple programs that label data. Snorkel can extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. It is focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available.
- Dark Vision: an application for analyzing video and audio to discover its content without requiring a user to watch or listen to the materials. After processing the file, the app automatically creates a summary that contains a set of tags, personalities, and locations. This summary helps with the categorization of the content without wasting your time watching it.
An audit should help you not only uncover the hidden information but also trace its sources (and manage them, if needed). Also, keep in mind that not all the data is useful for the business. Use audits to determine what’s worth retaining and what can be deleted in order to free the storage space.
Break the silos
It might happen that one team (or department) generates data that could be useful to others but this data remains isolated within a single team and hence, results in data silos. We’ve recently written an article on identifying and breaking down silos so we highly recommend checking it out. But in short, the exposure and elimination of silos lead to increased transparency and visibility of the data and might contribute to dark data discovery as well.
The amount of information that companies collect and store during regular business activities will only grow in the future as companies deploy more advanced technological tools for data collection and analysis. By managing these processes more efficiently and productively, companies can transform their dark data into an ally that will surely help to understand the needs of their customers and their own efficiencies or shortcomings.
What is meant by dark data?
Is dark data big data?
What is dark data used for?
What are examples of dark data?
The examples of dark data can differ depending on the company and industry. Usually, it is outdated, unutilized, and unstructured. The organization may not even know the data is being collected. For example, dark data can include:
- Log files and survey data;
- Previous employee information;
- Financial statements, geolocation details;
- Surveillance video footage and call-center transcripts;
- Notes, presentations, or old documents;
- Emails and email attachments;
- Inactive databases, etc.