What is Data Harvesting?

July 08, 2025

Cybersecurity might as well have its own language. There are so many acronyms, terms, sayings that cybersecurity professionals and threat actors both use that unless you are deeply knowledgeable, have experience in the security field or have a keen interest, one may not know. Understanding what these acronyms and terms mean is the first step to developing a thorough understanding of cybersecurity and in turn better protecting yourself, clients, and employees.

In this blog series, we aim to explain and simplify some of the most commonly used terms. Previously, we have covered bullet proof hosting, CVEs, APIs, brute force attacks, zero-day exploits, and doxing. In this edition, we dive into data harvesting.

Data Harvesting 101

Data harvesting refers to the automated collection of data from digital sources, such as websites, apps, APIs, databases, or public records, with the goal of drawing inferences. It’s often accomplished using tools like web scrapers, crawlers, or specialized software. There are legitimate reasons for data harvesting as well as nefarious purposes. We will dive into both.

The What and How

Data harvested without consent sourced from data breaches, phishing scams or malware – like personal information, login credentials, credit card numbers, location data, social data (such as likes, posts and connections), behavioral data (such as browsing history and habits), or medical records.

Data harvesting is carried out through various methods, each with different levels of transparency and legality. One of the most common tools is cookies and trackers, which are embedded in websites to monitor user behavior, such as browsing patterns, clicks, and time spent on pages. APIs and scrapers are also widely used to systematically extract data from online platforms, often automating the collection of vast amounts of information in a short time. Apps and connected devices can harvest data through user-granted permissions—or sometimes through hidden scripts—gathering information like contacts, location, and device usage. More maliciously, phishing campaigns and malware can deceive users into giving up sensitive information or infect their systems to extract data covertly, posing significant security and privacy risks.

Legitimate Reasons for Data Harvesting

Marketing and Advertising: Businesses use it to understand consumer behavior, market trends, competitor pricing, and product performance. Companies use this harvested data to build detailed consumer profiles and deliver targeted ads. By understanding your interests, habits, and demographics, advertisers can increase the chances of clicks and sales.
Lead Generation: Collecting contact information for sales and marketing outreach.
Research: Academics and researchers use it to gather data for studies in various fields, such as social science, economics, and healthcare. AI Training is another upcoming field – large datasets are fed into AI models for training. This includes data scraped from the web (like text, images, or behavior patterns) to build chatbots, recommendation engines, and facial recognition systems.
Content Aggregation: Collecting content from multiple sources to create news aggregators or comparison websites.
Improving User Experience: Understanding user preferences and behavior to enhance websites and applications. Organizations analyze the data to uncover trends, improve services, forecast demand, or enhance customer experience. For example, a retailer might use browsing and purchase data to optimize inventory or personalize recommendations.
Data Brokerage: Data brokers collect and aggregate data from many sources, then sell it to third parties—like marketers, insurers, employers, or political campaigns.

Nefarious Reasons for Data Harvesting

Identity Theft and Fraud: Harvesting personal information (names, addresses, email, payment details) to commit identity theft or fraudulent activities.
Spam: Collecting email addresses for mass unsolicited emails.
Intellectual Property Theft: Scraping proprietary content, product designs, or strategic plans from competitors.
Data Breaches: If harvested data is not adequately secured, it can be vulnerable to breaches, exposing sensitive information.

Harvested data is often sold on darknet marketplaces. Once the data is harvested by “harvesters,” they often will dump this data on the darknet and provide it for sale across different marketplaces, often with the idea of financial gain. Collected data could be used for blackmail, doxing or stalking. Data collected by political extremists or activist groups may use the data for targeted attacks and campaigns.

To the left we see an example of a combolist (a list of email addresses and password combinations that may be used in a brute force attempt or credential stuffing operations to gain unauthorized access to servers and services) that was leaked and posted on a darknet site. Databases from data harvesting will often include usernames and passwords, fullz (full identity profiles), financial records or health records. These are all often highly confidential or sensitive and can cause a lot of harm and headache when posted without consent.

Anonymity Encourages Abuse

The darknet is a layer of the internet that was designed specifically for anonymity. It is more difficult to access than the surface web, and is accessible with only via special tools and software – specifically browsers and other protocols. You cannot access the darknet by simply typing a dark web address into your web browser. There are also darknet-adjacent networks, such as instant messaging platforms like Telegram, the deep web, some high-risk surface websites. Because of the anonymous nature of the darknet, data harvesters are able to go undetected, monetize data without revealing their identity and collaborate with others on the darknet.

Doxing Example

The darknet site, Doxbin, facilitates doxing by allowing users to upload text-based content related to individuals. The site claims to restrict content that is spam, child explicit material (CSAM), or violates the hosting country’s jurisdictional laws. However, in practice, there is minimal moderation, and information is often shared with the intent to target individuals.

The exposure of PII on Doxbin can lead to severe consequences for victims, including harassment, identity theft, and threats to personal safety. Victims may also be subjected to harassment through prank calls, spam emails, and cyberbullying on social media.

DarkOwl’s Use of Data Harvesting

DarkOwl data harvesting involves collecting information from the darknet, deep web, and high-risk surface web to provide intelligence to their customers. This data is used to identify threat actors, monitor cyber breaches, analyze darknet trends, and more. DarkOwl’s data collection process includes automated AI and manual analysis, with the goal of delivering high-quality, relevant, and timely intelligence.

What DarkOwl Collects

Darknet Data: The darknet is a layer of the Internet that cannot be accessed by traditional browsers and often requires specialized technology (proxies) – as well as a certain level of technical sophistication – to access. While the darknet is comprised of various darknets, Tor (or The Onion Router) is by far the most common. In addition to Tor, DarkOwl also scrapes content from peer-to-peer networks like I2P and Zeronet.
Deep Web Data: The deep web is technically part of the surface web and can be best described as any content with a surface web that is not indexed or searchable via traditional search engines. This includes surface web paste sites and websites that we discovered via authenticated means, e.g. websites with a surface-level that require user registration and/or a login to access meaningful information from the site. DarkOwl has hundreds of ‘deep web’ sites including marketplaces and forums, from which a mixture of authenticated and manual crawlers obtain information.
High-Risk Surface Web: Surface web content consists of anything on the “regular” internet that is public facing with a surface web top-level domain (TLD) and could be organically crawled/scraped by Google. This includes the landing pages and/or preview content for forums that DarkOwl also has curated deep web access to (i.e., registrations and authentication).
Chat Platforms: Chat platforms are any website (be it on the deep web or darknet), app, or service that’s primary purpose is for instant messaging. This includes message exchanges between individual users or groups of users who interact in topic based channels and groups. Some chats are collected from Tor services that are enabled with real-time anonymous chat features, others from specialized instant messaging or proprietary protocols like IRC andTelegram.
Breach Content: Data breaches are aggregate data files of information obtained without the owners’ consent. This can consist of commercial data leaks by threat actors (TAs) either after discovery of a non-secured database or misconfigured server, or by targeted malicious cybersecurity incident (direct breach). Such leaks include internal sensitive email records, usernames and passwords, personally identifiable information (PII), financial records, and more. Data breaches are often sold for profit on the darknet, although they are sometimes posted and leveraged by criminal actors for means other than financial gain or in the fallout of cyber warfare between nation-state sponsored cyber powers and hacktivists.
Other Sources: DarkOwl also has limited documents in its Vision database collected from misconfigured FTP and alternative DNS servers, as well as open public S3 buckets. Collection from these sources is less real-time and intentional as the other data sources described above.

How DarkOwl Collects Data

Automated AI: Automated tools and AI-powered engines to collect and process data in near real-time.
Manual Analysis: Human analysts augment automated collection, ensuring the quality and relevance of the data.

How DarkOwl Processes and Structures Data

Unstructured Data: DarkOwl collects data in its original, raw-text format.
Data Cleaning and Storage: Collected data is processed, cleaned, and stored in a secure environment.
Entity Extraction: DarkOwl identifies and extracts entities like email addresses, Social Security numbers, and cryptocurrencies.
Metadata and Context: Included metadata and source content provide context and allow users to quickly identify important data.

Why DarkOwl’s Data is Valuable:

Threat Intelligence: DarkOwl’s data can help organizations identify and understand emerging threats, including cyber breaches, ransomware attacks, and fraud.
OSINT Investigations: Darknet data is a vital part of OSINT (open-source intelligence) investigations to gather insights into specific individuals or groups, including their usernames, aliases, and online activity.
Digital Risk Assessment: DarkOwl’s data can help organizations assess their digital risk posture and identify vulnerabilities by seeing what information concerning them is available on the darknet.