Dark Data Explained: Risks, Costs, Management

Key takeaways
Now: what is dark data?
Why dark data accumulates
Common sources of dark data
Problems dark data might cause
Dark data discovery is not a one-time initiative
Ways to manage dark data
How to discover dark data?
Dark data costs
Conclusion
FAQ

In a world where 402.74 million terabytes of data are created daily, there are records and files untouched by the light of business analytics, called dark data. More than 50% of all collected data remains untapped, unlabelled, and unnoticed, leading to issues from wasted storage to missed opportunities.

Fortunately, we know how to find, optimize, and use it. Discover how dark data management can be a valuable aid to business decision-making.

What Is Dark Data and How Do You Manage It?

Key takeaways

Over 50% of business data remains unused, resulting in both revenue leakage and growing operational costs.
Common sources of dark data include IoT devices, recording systems, customer interactions, and business operations.
To effectively manage dark data, you need to develop a strong data management strategy, audit your data assets, and use AI tools for data discovery and labelling.

Now: what is dark data?

According to Gartner, dark data is the “information assets organizations collect, process and store during regular business activities, but cannot use for other purposes” (such as analytics or monetization). In other words, it is hidden from view and is hard to access or analyze properly. Organizations may not even know about some data they collect and store.

Dark data examples include:

Log files;
IoT sensor data;
Financial statements and outdated versions of documents;
Notes and presentations;
Emails and email attachments;
Call-center transcripts and customer reviews;
Customer information.

Now that we are clear with the definition, we can look at its main types.

Untapped internal data is the information that organizations collect, store, and process, but do not use more than once or for anything else except for its single purpose.
Non-traditional unstructured data is usually attached to and/or related to audio, video, and image files. Thus, it cannot be adequately explored without special technologies, such as computer vision, advanced pattern recognition, or video and sound analytics.

Dark data is more than “old records” or “bad data”. When discovered, it has the potential to improve business decisions and help organizations to see the bigger picture in terms of market opportunities and revenue streams.

Why dark data accumulates

The information can be unusable because of different reasons, such as its location, excessive quantity, or poor data quality management strategy.

Limited access

Most dark data consists of information that is no longer accessible. People continuously store data on their private and company devices (i.e., USB sticks, mobile devices, or portable hard drives). So when this device is lost or when the login credentials are forgotten, access to data is blocked, and information may be lost forever.

Another example is a proprietary file format that can be read only by a specific program. It could happen that this program can no longer be used or is no longer available in the required version. Hence, the data remains trapped.

Lack of processing resources

Companies collect enormous amounts of data but don’t analyze all of it due to a lack of suitable data processing tools. Besides, some records, such as image files and spoken text in audio files, are available in formats that require specific instruments. Or there might simply be a lack of knowledge and skills to integrate the existing data into your operations and deliver valuable insights.

To resolve this issue, organizations need to deploy sophisticated and corresponding tools for analysis. But in addition to that, businesses will also need data science experts who, in turn, might be difficult to find.

Poor data governance

While 88% of organizations claim to have a data management strategy, 44% of them lack basic measures like archiving or data lifecycle management. Such oversight creates a situation where businesses don’t know what data they have access to, who owns it, how it should be used, or how it should be protected. In such cases, there’s a higher chance the organization is operating in silos, carries noncompliance risks, and is prone to data breaches. It all can lead to serious consequences, from misguided business decisions to fines.

Shifting business priorities

When business priorities change, some datasets can become less relevant and, consequently, no longer actively analyzed or applied. As business attention shifts to new initiatives, such data risks becoming dark, even as it remains valuable. The lessons from past strategies remain unlearned, and current decisions are left unguided by historical operations. For example, shifting from brick-and-mortar stores to ecommerce can leave old transactional data ignored.

Regulatory compliance requirements

Businesses accumulate large volumes of data to comply with data security and privacy regulations. Data like password reset logs, payment histories, and system configuration backups are rarely analyzed. Nevertheless, it can contribute to risk detection and help to unveil trends, serving beyond its regulatory purpose.

Common sources of dark data

Unused data accumulates in business systems during daily operations, especially in the absence of a strong data governance framework.

Sensors and IoT devices

Rising demand for real-life data monitoring and predictive analytics drives the IoT sensors market’s growth and, consequently, the volume of device telemetry it generates. The data from wearables, connected vehicles, and equipment tends to be overlooked. Its volume, variety, and velocity require proportional resources for processing, and analytics businesses often lack them.

For example, a manufacturing plant uses sensor data on vibration, temperature, and pressure to trigger real-time alerts in case of a malfunction. The data also gets logged and stored on local servers, but never analyzed. In this case, firstly, it takes up memory without any business advantage. Secondly, its potential for preventive maintenance and for identifying inefficiencies remains in the shadows.

Customer interactions

Customer-facing apps and platforms usually collect various data, from browsing history to chatbot conversations. Whether such interactions result in purchase or cart abandonment, the data is often captured, stored, and forgotten. The thing is, such unused data assets contain valuable insights on user behavior and intent, and can make the difference in understanding the purchase journey and content relevance.

Consider an ecommerce store. When a business focuses only on high-level metrics like traffic and conversion rate, the records, let’s say, about product pages viewed remain overlooked. The company cannot understand that customers struggle to find the right product or are confused by unclear descriptions. As a result, it indirectly influences the very same KPIs and metrics the business prioritizes most.

IT systems and platforms

The data from internal systems and platforms comes from servers, applications, databases, network devices, and other sources. It includes application events, infrastructure performance fluctuations, security system records, and other files and logs. It is not only constantly generated but also usually captured and stored by default. Rarely addressed, this data slowly turns into dark data.

For different reasons, such datasets could fail to be recognized and analyzed. Sometimes they are incompatible with modern analytics tools or are siloed across servers. Other times, such records are perceived as having low value or are just too large to process with limited resources. As time passes, such untapped intelligence increases data storage costs and leads to operational inefficiencies, especially at the enterprise level.

Business operations

From reports and contracts to emails and meeting recordings, daily business processes produce large volumes of data. CRMs, ERPs, HR, and billing systems gather records that might never be used, analyzed, archived, or deleted, so they become an invisible burden. Nevertheless, if their lifecycle is properly managed, dark data from business operations can eliminate business blind spots.

For example, reviewing customer communications helps reveal bottlenecks, such as a confusing return policy, and detect signals of churn, such as reduced engagement tone. Let’s take logistics as one more case. Here, the stop duration metric, which is often collected but rarely analyzed, can help define consistently long stop durations, detect avoidable waiting times, and optimize routing.

Recording systems

Recording systems can capture audio, video, or screen activity as part of everyday business activity. Video surveillance systems, call recording systems, and collaboration tools like Zoom or Teams generate data that is commonly reviewed only when incidents occur. You can have an extensive archive of meeting recordings that are never summarized, tagged, or labelled.

Let’s take a look at customer service calls, which organizations usually record for quality assurance, compliance, and legal reasons. Unfortunately, businesses do not always inspect them to detect recurring issues, pain points, and product problems. Yet, such data can provide a comprehensive customer overview and highlight emerging needs.

Problems dark data might cause

Although dark data holds potential value for organizations, it also causes some problems if not handled timely and effectively. Here are the main concerns to keep in mind.

Storage inefficiencies

As the amount of collected datasets grows, it requires more storage space. If you don’t use these records, they turn into junk that is taking up valuable space and eats up your resources. This is when companies need to prioritize which information to utilize and which to push aside or delete. Another issue with large datasets stored in a company lies in growing data management complexity.

A company that provides integrated healthcare services faced risks associated with data sprawl and large volumes of dark data. The latter overburdened data storage resources, resulting in higher expenses. The organization implemented a data lifecycle management system and comprehensive analytics and found that one-third of its storage was filled with files untouched for 5 years, most of which were orphaned. After the optimization, it turned out that dark data discovery and governance led to $7.5 million of yearly savings.

Security exposures

The part of your dark data that is not securely hidden is the most vulnerable to leaks and theft. It can be easy for hackers to access systems that use outdated software components. For example, the poorly secured Starwood guest reservation system purchased by Marriott was not properly integrated or actively managed after the acquisition, leaving sensitive guest data vulnerable. It resulted in a 500 million guest data leak, including 5.25 million unencrypted passport records.

Increased storage costs

Imagine your enterprise has 300 TB of data for short-term storage (Amazon S3 Standard) and 3.5 PB for long-term storage (Amazon S3 Infrequent Access). Its approximate total cost would be $52,000 per month, which translates to more than $630,000 per year. Knowing that there is around 50% of dark data, you can see that your organization pays $300,000 for assets that aren’t used in business. As memory capacity keeps growing, storage costs increase as well, not to mention rising storage prices.

Compliance risks

Compliance with regulations like GDPR and HIPAA implies that all data that they cover (Personally Identifiable Information and Protected Health Information respectively) is secured – even the data you may not know you store. It can be kept longer than required, be at risk of exposure, or just poorly protected. Hidden data hampers the company’s response to audits and regulatory requests and can lead to fines ranging from thousands to millions of dollars. In the aforementioned Marriott data breach, the hotel chain paid $52 million to settle charges for failing to secure sensitive guest data.

Difficulties in AI adoption

The major hurdle to AI adoption is poor data. Yet, only 46% of respondents are confident in its quality within their organization. Untapped datasets contribute to the problem: they are often left unused, untagged, and siloed, which significantly worsens the mess. If the data is not even visible, organizations struggle to define what to feed an AI model to get relevant results. If seen, dark data is often not validated, incomplete, and inconsistent. If you feed your AI model with such low-quality data, you’ll get the results of the same quality.

Hidden valuable information

41% of business leaders acknowledge they lack data understanding due to its complexity or limited availability. However, dark data may contain valuable insights that an organization has always been looking for. Hence, it equals lost opportunities in terms of business decisions, a lack of holistic view of the processes, and as a result, lost revenue and resources. You can use dormant business data to mine essential intelligence and patterns in internal processes and user behavior for continuous improvement.

Envision Racing, a British motor racing team, succeeded in uncovering dark data potential and improved drivers’ strategy. For many years, the company collected audio transmissions between drivers and their engineers. With the help of natural language processing and deep learning models, the developers cleaned these records of noise and deciphered code names and acronyms, enabling the teams to gain a competitive advantage from the analyzed data and adjust their strategies during the race.

Dark data discovery is not a one-time initiative

When data remains undiscovered or poorly governed, it becomes an operational burden rather than a strategic resource. To prevent this, dark data discovery and management should be treated as a routine operational task rather than a one-time initiative.

Embedding regular audits or automated discovery processes helps reduce operational blind spots, improve decision-making accuracy, and ensure that potentially valuable datasets do not remain invisible and unused. At the same time, maintaining consistent data policies (such as standardized naming conventions, structures, and data rules) makes it significantly easier to integrate previously forgotten or newly discovered datasets into existing systems and analytics workflows.

Senior Machine Learning Engineer, SoftTeco

Volha Hedranovich

Ways to manage dark data

The more information a company produces, collects, and holds across multiple systems, the more important it is to have a solid data management strategy in place. The following practices can help you detect dark data and derive value from it.

Use data retention policies

Organizations use data retention tools to create and enforce storage and security policies for prescribed periods. Such policies determine which types of data should be kept and which should be deleted. If the information is meant to be deleted when its retention period expires, the policies outline specific procedures for doing so securely.

You also must consider the legal implications of this task. For example, if data covered by a specific mandate or regulation, such as GDPR or CCPA, appears anywhere in dark data collection, its exposure could result in legal and financial liability. So keep in mind local and international privacy requirements.

Moreover, data retention policies encourage organizations to review their databases and double-check whether there are any essential records they didn’t recognize at first. By implementing such policies, you can close the gap for any missed opportunities.

Build strong data culture

One way to manage your data lifecycle and prevent it from becoming dark data is to establish clear policies and an organizational culture for data handling. Ensure that each dataset has a business owner or technical owner who are responsible for its quality, documentation, and maintenance. Set a mindset that every piece of data should be collected for a purpose, not “just in case”. Finally, make data discoverable and accessible across the teams.

Create a comprehensive data lake platform

Data lakes can handle large volumes of data in various formats, so building one to centralize all your data assets might be a good idea. Such platforms help your organization break down data silos, store files in their original format, and improve data discoverability. In addition, data lakes are optimized for ML and advanced analytics, meaning value extraction from audio, video, and log data becomes faster and easier.

Frequently audit your data

Finding and classifying unknown data is crucial for organizations’ privacy and compliance initiatives. After all, you can’t protect what you don’t know you have (or what level of protection is needed). So, to shed some light on hidden resources, you can use specific tools to help you out.

DeepDive is a tool developed at Stanford University that extracts data relationships and makes corresponding inferences. DeepDive uses machine learning to convert dark data into a structured format that can be combined with existing data sets.
Snorkel is a system where a user can create large training sets by writing simple programs that label data. It is focused on accelerating the development of “dark” data extraction apps for domains in which large labeled training sets are not available.
Dark Vision is an application that analyzes video and audio to discover their content without requiring a user to watch or listen to the materials. After processing, the app automatically creates a summary that contains a set of tags, personalities, and locations.
Collibra is an enterprise data intelligence platform that discovers, organizes, and manages a company’s datasets, reducing compliance risks and increasing transparency. The solution is cloud-based and uses AI and ML to automate governance tasks.
Alation is a data catalog and discovery platform. It automatically scans data from data lakes, data warehouses, and cloud storages to discover dark data. Alation’s natural language search and ML-powered suggestions help users to find relevant data fast.

Data audit should help you not only uncover the hidden data but also trace its sources (and manage them, if needed). Additionally, keep in mind that not all the data is useful for the business. Use audits to determine what’s worth retaining and what can be deleted in order to free the storage space.

Break the silos

Separate teams or departments maintain some datasets. Without proper sharing and integration into business systems, they turn into data silos – a home for dark data. The exposure and elimination of silos increase transparency and visibility of data and may contribute to the discovery of dark data.

Use AI and ML tools

AI tools, such as Microsoft Purview, Dataplex Universal Catalog, Palantir Foundry and Collibra can help your company find and handle dark data. For example, you can configure the algorithm to scan data from files, databases, cloud storage, emails, logs, and multimedia, finding hidden datasets and their connections. After discovery, AI models can do dark data analysis and profiling, tag sensitive data, and detect metadata patterns. They also generate summaries for videos and link related content across siloes, improving data visibility and accessibility across the departments.

How to discover dark data?

Dark data detection not only includes assessment of all records your organization possesses, but also its evaluation over the time of business activity.

Conduct data audit

Start dark data discovery by auditing all your data sources and storage: all repositories, document systems, CRMs, and transactional systems. Define what types of data they collect or store, their owner, and the last access/update date. That will give you a clearer understanding of whether this data has business potential or just takes up space.

Evaluate dark data characteristics

To discover dark data and efficiently manage its occurrence, you can also consider the following qualities.

Data staleness. Answer the question of how long ago was the dataset last updated or modified. It helps you to understand how outdated and unused your dark data is.
Usage frequency. This factor shows whether a dataset is a reliable source for business models, systems, and BI solutions.
Data provenance. Review all data pipelines and the ways upstream and downstream apps use it.
Data quality. Check discovered assets for missing data, duplicate values, and incorrect patterns and decide what datasets you need to fix and what to delete.
Redundancy. Examine whether your data has copies, duplicates in multiple systems, or several document versions, and use ML to reduce data clutter.
Classification. Search for unclassified, untagged, or unlabelled data and check whether it is sensitive to avoid penalties for non-compliance

Use automation tools

You don’t have to discover data assets manually. Modern metadata tools like Alation and Apache Atlas not only help find scattered, hidden data but also categorize and efficiently manage it, creating a comprehensive data catalog. They scan your on-premises and cloud repositories, find, and classify the data they reveal. You’ll get a complete picture of enterprise data classified by source, type, lineage, sensitivity, access rights, and more.

Enhance security

When you’ve discovered dark data, check whether any discovered data was overlooked from a security and compliance perspective. Enforce data protection measures, such as encryption, tokenization, or masking, for the affected records. Finally, establish a data governance framework that defines roles and responsibilities for data management, establishes data standards, and defines deletion policies.

Dark data costs

As hard as it is to estimate potential gains from dark data discovery and analysis, it is easy to see potential losses it implies.

Data storage costs

In 2024, 62% of survey respondents reported cloud storage budget overruns, and this number tends to grow from year to year. Keeping in mind that over half of your datasets are untapped, you can easily calculate your data storage overpay for records you never use. Furthermore, cloud service providers also charge businesses for moving dark data from the cloud (cloud egress), so it’s vital to keep track of all your data in order to cut costs.

Data breaches

Due to its concealed nature, dark data imposes security risks, as it is less protected and cared about. While the world’s average cost for a data breach totalled $4.4 million in 2025, specific companies received fines ranging from a few hundred to a million dollars. Moreover, data breaches usually disrupt business operations, which was reported by 86% of businesses, further exacerbating losses.

For example, a British law firm, DPP Law Ltd, got fined after hackers stole 32 GB of personal information from their network in April 2025. The company got a $78,000 penalty for failing to protect electronically held information.

Data ROT

Dark data can overlap with Redundant, Obsolete, and Trivial data, meaning it may include missing fields, inaccuracies, and duplicate values. With the decay rate of 2.1% per month for B2B data, a quarter of your records would become outdated by the end of the year. This poor data can, on average, cost organizations $12.9 million a year, damaging their compliance, market opportunities, and decision-making.

Conclusion

Dark data can be both a blessing in disguise and an operational burden. It can conceal meaningful, yet invisible datasets, which can drive your business further in the market. At the same time, such high-volume data can increase storage costs, slow AI adoption, and pose security and compliance risks.

To find and manage dark data, you need to get an overview of all the data your organization owns, evaluate it, organize it, and separate the wheat from the chaff. The next step is to develop a comprehensive data governance framework, introduce data management policies, and build a strong data culture across the departments.

Today, data, its quality, and analytics determine a business’s success. As the organizations collect and store it during regular business activities, the issue of unrecognized, forgotten data becomes sharper. If you face hurdles with hidden data, SoftTeco’s data scientists and engineers can help you unravel complexities. Contact us and get a free consultation.

FAQ

Is dark data big data?

Dark data cannot be referred to as big data, but it is certainly a part of it. Big data refers to large and complex collections of data that cannot be processed using traditional tools. We can also view it as a set of technologies, practices, and solutions related to solving business problems. When analyzed, the insights provided by these large amounts of data lead to real commercial opportunities. Dark data, in turn, is hidden and hard to analyze, even though it can be of value to a company and can be part of a big data set.

What are examples of dark data?

The examples of dark data can differ depending on the company and industry. Usually, it is outdated, unutilized, and unstructured. The organization may not even know the data is being collected. For example, dark data can include:

System log files and survey data;
Former employee files;
Financial statements, geolocation details;
Video surveillance data and call transcripts;
Notes, presentations, or old documents;
Archived emails and email attachments;
Old CRM records and social media archives;
Inactive databases, etc.

Is all dark data worth analyzing?

No, not all dark data is worth analyzing. While some dark data has high potential value, other datasets should be simply deleted. The most balanced approach includes selective analysis. For doing so, you need to:

Decide what business problem you want to solve
Assess dark data potential value
Check data sensitivity and compliance
Estimate data quality and analytical feasibility
Prioritize your datasets by impact and effort

What’s the first step to tackling dark data?

The first step to tackling dark data is discovery. To make hidden datasets visible, you need to learn why they are collected, find out who their owner is, and understand where they are stored. Create a comprehensive catalog of all the data your organization has, looking for dark data in local storage and cloud, business systems, logs, archives, and file shares.

Can small businesses have dark data problems too?

Yes, small businesses can have dark data problems too, although to a lesser extent compared to enterprises. Like large companies, SMBs have CRM, POS, accounting software, and collaboration tools. They often collect and store customer records, inventory logs, and audit trails that businesses never analyze or even know about.

What Is Dark Data and How Do You Manage It?