Predictum Blog

Nov 03, 2017

Behind the Scenes of High-Volume Data Extraction

This is the second of three posts in a blog series, which is associated to Predictum’s webinar, Reactivating Your Dormant Data to Extend its Business Value, on November 16, 2017.

In this post, Farhan Mansoor gives a glimpse into how our Data Extraction Service works. Highlighted features and functions of the Data Extraction Service are presented in our on-demand webinar, which will be available shortly.

Q: What is the Data Extraction Service and what business problem does it help to address?

FM: The Data Extraction Service is designed to evaluate dormant, unstructured data and extract meaningful elements to be structured for effective reuse. Once the meaningful data is structured, it’s maintained in a database so that it can be reused and shared more widely across an organization.

The Service evaluates high volumes of unstructured data and extracts only the elements that have potential value according to a company’s business requirements. It evaluates data against specified formats, recognizes it as meaningful by its surrounding context, and uses keyword patterns for structuring and standardization. The Service can be customized the particular needs of individual companies and for various industries.

In our webinar, I’ll present a scenario where we had used the Data Extraction Service to extract and structure historical data from an enormous number of spreadsheets, as an example data source, using JMP® software in the extraction process.

Q: How do you even begin to tackle the investigation of a vast amount of dormant data?

FM: In one example project, we began by working with our client to explore and identify specific formats and contexts to help in evaluating their unstructured data. The overall approach was to work alongside our client in an iterative process with logical stages to ensure that the solution was achieved through continuous collaboration.

Meaningful, structured data is extracted from unstructured data in a spreadsheet

Meaningful, structured data is extracted from unstructured data in a spreadsheet

We dove in to a huge number of data files—hundreds of thousands, in fact—to gather a representative subset of data formats from the client’s entire inventory of dormant data. The exploration was exhaustive, but it paved the way for preparing meaningful data to be extracted and structured to our client’s exacting standards.

Q: Did you come to any key insights of your own during the development process of the Data Extraction Service?

FM: Learning about how the business value of reactivating unstructured, historical data for present-day reuse was an eye-opening experience. Also, the extent to which the Data Extraction Service is able to process high volumes of unstructured data successfully is remarkable.

Farhan Mansoor is a Software Engineer at Predictum Inc. Connect with Predictum Inc. on Twitter and LinkedIn.

For more insights in this blog series, catch our first post, Dormant Data Equals Lost Opportunities, and our third post, The Recipe for Success in Research and Experimentation.


Share this: