From DarkOwl’s CTO: Deciphering Darknet Big Data

Ramesh Elaiyavalli has joined DarkOwl as its Chief Technology Officer, bringing a wealth of data science expertise and a zest for solving complex technical problems. We spoke to Ramesh to give our readers an opportunity to hear his unique thoughts and present a fresh perspective about the critical intersection between the darknet and big data.

One thing I’ve learned since joining DarkOwl is that the darknet, the deep web and all that encompasses the underground criminal ecosystem is constantly evolving, in size, shape, and color. Having automated crawlers deployed in the darknet since 2015, the team at DarkOwl knows firsthand the challenges of maintaining in-depth knowledge of this everchanging digital data landscape.

I’ve also noticed that some darknet-centric companies operate with a focused mission of threat intelligence and security awareness providing custom, highly tailored intelligence products to answer their customers’ cybersecurity questions. At DarkOwl we employ a more agnostic viewpoint, focusing on maintaining the largest set of commercially available darknet data with prudent consideration for the various “V’s” of Big Data philosophy, applying them to all data discovered across many different anonymous networks and deep web criminal communities.

While we have the in-house expertise to dig deep into the diverse anonymous data sources at our disposal, our products are designed to drive high-value business decisions through fast, frequent collection of accurate, and disparate data from a wide array of distributed data sources.

Big Data Forces Ingenious Architectures

The NIST Data Interoperability Framework defines “Big Data” as large amount of data in the networked, digitized, sensor-laden, information-driven world. The authors of that framework describe “Big Data” and “data science” as essentially buzzwords that are essentially composites of many other concepts across computational mathematics and network science.

Data can appear in “structured” and “unstructured” formats. According to IBM, not all data is created equal. Structured data is often quantitative, highly organized, and easily decipherable, while unstructured data is more often qualitative, and not easily processed and analyzed with conventional tools.

In the last decade the amount of unstructured data available to an individual has skyrocketed. Think about the amount of raw data a person consumes or generates on any given day, through mediums like SMS text messaging, watching, and/or creating YouTube videos, editing, and sharing digital photographs, interacting with dynamic web pages, and keeping up with the demands of social media.

The darknet and deep web is a vast source of data: structured, semi-structured and unstructured that forces an ingenious data architecture to collect, process, analyze, and distribute meaningful and targeted datasets to clients and users across diverse industry verticals such as FinTech, InsureTech, Identity Protection and Threat Intelligence providers. At DarkOwl we employ a modified model of “Big Data” often depicted by the “V’s” of Big Data.

Volume – DarkOwl endeavors to deliver petabytes of data processed in real time with crawlers operating across different anonymous networks, deep websites, and platforms. As of this week, our Vision system has collected and indexed over 278 million documents of darknet data across Tor, I2P, and Zeronet in the last year. Our entities system has uncovered and archived over 8 billion email addresses, 13 billion credit card numbers, 1.6 billion IP addresses, and over 261 million cryptocurrency addresses.

Velocity – DarkOwl’s resources are designed to provide fast and frequent data updates, such as collecting from real-time instant messaging sources and capturing live discussions between users on darknet forums. In the last 24 hours, our system crawled and indexed over 2.5 million new documents of data.

Veracity – DarkOwl collects the most accurate data available from legitimate and authentic sources discovered in the darknet, deep web, and high-risk surface web. DarkOwl scrapes darknet data without translation in its native language to avoid contextual loss from automated in-platform translation services.

Variety – The data DarkOwl discovers is disparate from diverse and distributed data sources such as Tor, I2P, Zeronet, FTP, publicly available chat platforms with instant or new real-time messaging. We collect everything from darknet marketplace listings for drugs and malware to user contributions to forums and Telegram channel messages.

Value – DarkOwl delivers its data in a variety of delivery mechanisms along with our expert insights to help drive high-value business decisions for our clients and stakeholders. Darknet raw data helps provides valuable evidence for qualitative investigations to quantitative risk calculations.

Voices – We added an additional “V” to the model to include the voices of the various personas and threat actors conducting criminal operations in the underground. Our Vision Lexicon helps users easily decipher and filter by marketplace, vendors, forums, threat actor pseudonyms, and ransomware-as-a-service (RaaS) operators.

Multi-Dimensional Darknet Data Collection Strategies

Before we can jump into the technological architectures available to deliver scalable Big Data, we should discuss the multi-dimensional facets of data collection from dark networks. There exists an unspoken spectrum of darknet data collection. On one end of the spectrum, there is a collection strategy focused on directing a small number of assets to facilitate incredibly deep and near-constant coverage of a relatively tiny segment of what is presently an unquantifiable data space. Defining this segment outside of publicly known, well-established sources of malicious activity without buying illegal data or compromising our integrity is tricky.

On the other end of the spectrum is a collections strategy focused on sending out a much larger number of assets to facilitate broader collection across many different sources to capture and characterize as much of this unquantified data space as possible. At DarkOwl we show preference for this end of the spectrum as it increases the variety and veracity of our Big Data model. We also dedicate collection resources to a smaller, select number of darknet services that require authentication, solving a captcha or puzzle, or is accessible by invitation only. We attempt to augment our broad-spectrum strategy by collecting from these sources at a greater depth and higher frequency than other sites.

I think it’s also important to add here a third dimension of time. Collecting data from a given source once without revisit or frequent updates is of considerably less value than data collected at a regular operational tempo. Likewise, DarkOwl also has a strict retention policy for documents from the darknet – much from sources no longer available or offline – in support of historical analysis and developing analytical trends over time. Many of the documents help characterize and track the evolution of voices of threat actors for law enforcement investigations and others feed risk calculations such as the original date compromised corporate credentials and company exposure on the deep web appeared.

Our data collection strategy endeavors to balance these three dimensions: breadth, depth, and time in our data collection strategy to ultimately maximize the “Vs” of Big Data with an emphasis on contributing to the value of our clients’ bottom line.

Big Data Delivery Mechanisms

Data warehouse – A data warehouse consists of mostly structured data. Think of it as a giant database that you can access via SQL. Here you can store names, SSNs, phone numbers, email addresses and so on – with very large volumes. Data warehouses are traditionally based on RDBMS technologies such as Oracle, DB2, Postgres etc., and they take a ton of resources to build and maintain, hence the drop in popularity over time. We do not have a data warehouse at DarkOwl.

Data lake – A data lake consists of a combination of structured AND unstructured data. Mostly unstructured data – as in medical transcriptions, court documents, audio, video, screen shots and so on. The structured data is mostly to tag and link the unstructured data. Data lakes are more popular now due to the ease of creating lakes. Data lakes are supported by cloud native vendors such as Amazon AWS, Google Cloud, Microsoft Azure, etc. At DarkOwl, we populate many of our customer’s data lakes. We can also stand up a custom data lake which contains a subset of our data that we give customers access to.

Data feeds – Data feeding describes the process of pushing parts of our Big Data over to the customer side. For example, we feed only credentials to some customers, or only credit cards to another, and in some cases, we provide a daily snapshot of everything we have visibility of directly to the customer for their own business use case. Feeds are technically accomplished by setting up a receiver on customer side – usually as a secure Amazon S3 bucket. We can also set up feeds into Azure or Google storage. Keep in mind, feeds are always this point in time forward. If customers need data from the past, we will charge separately for a one-time dump, also called “data hydration” or “seeding.”

Data streaming – To process data coming at us rapidly, we use open-source industry technologies such as Kafka at DarkOwl. Such services are mostly for internal use, but we could easily setup our customer as one of the subscribers to our data stream. This especially makes sense when the velocity of data is very high, which is often the case for darknet data. For example, take Tesla. Their car is a moving big data machine. Every turn, every camera is emitting massive amounts of data that cannot be pushed fast enough to a customer’s data lake via a data feed. In these high frequency data situations, we will allow customers to consume directly from our Kafka stream. We will obviously only explore this option if we trust the customer and they pay us lots of money.

At DarkOwl, we have a variety of customized solutions we can deploy quickly to satiate the needs of all our customers.

Final Thoughts

As you can see, the data science challenges of collecting, organizing, and delivering continuous relevant darknet Big Data are intellectually fascinating and absolutely exhilarating to undertake.

I look forward to augmenting and refining DarkOwl’s Big Data product line through implementing new technical solutions and expanding into novel, cutting-edge anonymous sources. Reach out to us directly as I look forward to having a conversation about how your company or organization could benefit from Darknet Big Data from DarkOwl.