Uncovering Valuable Insights: A Guide to Data Hunting
Written on
Chapter 1: The Evolution of Data Insights
In the late 2000s, the spotlight turned to data as organizations faced challenges in deriving meaningful insights from vast amounts of data haphazardly gathered from customers, products, and services. The groundwork laid by computer science, statistics, and mathematics was essential in addressing the issue of data overload while also aiming to reduce long-term costs. Automated systems, platforms, and decision-making processes became the preferred methods for tackling data saturation and revealing critical insights. This phenomenon is akin to finding needles in a haystack.
As we navigate through the 2020s, the challenge of data overload has intensified, compounded by an even larger volume of data, expensive cloud storage solutions, a plethora of data tools ranging from collection to visualization, and a continuous stream of new data-centric research. The once-promised solution is no longer reliable, as consistent and valuable insights remain elusive. Common problems such as statistical errors, biased datasets, and other algorithmic issues frequently make headlines. Essentially, the data industry finds itself back at square one regarding effective and reliable data methodologies. Moreover, the landscape of data is expanding beyond its traditional realms, incorporating fields such as sociology, political science, economics, and data ethics, with an emphasis on the impact on historically marginalized communities.
The urgency to identify high-quality data quickly has reached new heights. With algorithm-based methods under scrutiny, data professionals must embody two key traits: relentless curiosity and adaptability. Success in this field hinges on these qualities. This necessitates a commitment to extensive reading, encompassing everything from reference materials to opinion pieces. Personally, I engage with 2-3 opinion articles most days and skim through several research papers, legislative proposals, or public reports each week. According to the Artificial Intelligence Index Report 2021, the volume of AI-related publications on arXiv skyrocketed from 5,478 in 2015 to 34,736 in 2020—a staggering increase of over sixfold. This translates to about 95 papers daily, many of which lack peer review or rigorous accuracy checks. While I don't suggest relying solely on arXiv, it remains a popular source for many in the field.
Over time, I have developed a systematic approach to evaluating data-centric articles. My method begins with assessing the quality of the data, followed by scrutinizing its interpretation. I categorize data using my own scale, which indicates the quality of the information presented in the article. This scale informs my level of engagement with the data model, tool, or platform being discussed. This foundational work falls under the umbrella of data ethics, which prioritizes human welfare over machine efficiency.
Let me explain my data scale, which comprises five categories: raw, half-baked, good, rotting, and stale. "Raw" data is unprocessed and generally unusable. "Half-baked" data has undergone minimal algorithmic processing but raises more questions than it resolves. "Good" data is adequately refined through appropriate algorithms, yielding valuable results. "Rotting" data refers to poorly repurposed datasets, where information collected for one purpose is misapplied to another, leading to inconsistencies. Lastly, "stale" data signifies previously valuable information that has become outdated, often seen in statistics that aren't regularly updated.
After a preliminary assessment of the data, I classify it within one of these five categories. If the data is classified as rotting or stale, I abandon the analysis altogether. I've learned to recognize these situations when I find myself either bored due to the lack of relevance or frustrated by unanswered questions. In contrast, with raw, half-baked, and good data, the value may not be immediately apparent until I further contextualize and analyze the data model, tool, or platform in question. To do this, I apply four principles of data ethics and search for answers to specific questions within the articles.
- Intention: How do the authors articulate the goals and benefits of their work?
- Transparency: Are the data operations clear and comprehensible? How innovative is the proposed work?
- Ownership: Who is responsible for the data inputted and outputted in the system? Is the data holder also the owner?
- Outcomes: How does the proposed work impact historically excluded communities? Are the insights valuable for vulnerable groups?
Reading op-eds takes only a few minutes, while the time spent on reference books and detailed documents varies based on necessity. The most challenging to digest are reports from national and international organizations, as well as academic conference papers.
For those who engage with scientific literature in AI, data, or computing, conference papers typically follow this structure: abstract, introduction, literature review, methodology, experiments and discussion, and conclusion. Intention is generally stated in the abstract, introduction, and conclusion, while transparency appears in the methodology section. I also assess the novelty of the proposed method while skimming the methodology and related literature. Ownership details are most prominent in the experiments and discussion section, and outcomes are typically found in the discussion as well.
Admittedly, assessing transparency requires significant time investment, so I limit this to 20 minutes of focused skimming. The other sections take roughly 20 minutes combined. I often revisit articles that catch my attention on social media, spending about 45 minutes on them. Responses regarding intention and outcomes can vary depending on the publication's editorial stance, while ownership issues often have default answers due to the industry's reliance on benchmark datasets. However, the growing discourse surrounding ownership suggests these queries will gain prominence sooner than anticipated. A consistent benchmark for identifying quality data is my ability to address transparency questions within 20 minutes. If it's straightforward, I label it as good data; if I have partial answers, it's half-baked; otherwise, it's raw.
So there you have it. Happy data hunting!
Chapter 2: Recommended Viewing for Data Enthusiasts
To further enhance your understanding of data insights, consider the following video resources:
Building Better Hunt Data explores effective methodologies for extracting meaningful insights from complex datasets, delving into strategies that enhance data utilization.
Threat Hunting with Data Science, Machine Learning, and Artificial Intelligence discusses the role of advanced technologies in identifying threats and optimizing data analysis.