For most organizations, data can be classified as being either structured or unstructured. Structured data is essentially data that exists within a database. This data has been carefully curated into rows and columns, each containing a specific type of data. Unstructured data, on the other hand, often consists of file data. The Microsoft Word document I used to write this blog post, for example, is unstructured data.
Of course, in the real world, the lines between data types are sometimes blurred. A collection of Microsoft Word documents, while unstructured, might contain data that could conceivably be extracted from the documents and turned into a structured data set.
Historically, the practice of extracting data from documents and adding that data to a database has been both a complicated and an expensive undertaking. However, Microsoft has developed a SharePoint Online add-on called Syntex that uses machine learning to extract data from documents and then place that data into rows and tables.
SharePoint Syntex: An easier way to extract data
SharePoint Syntex works with a wide variety of document types, but you will only be able to successfully extract data from your documents if you choose the data processing model type that is best suited to the data that you are working with.
At the beginning of this post, I mentioned that data could usually be classified as either structured or unstructured. However, there is a bit of a gray area between structured and unstructured data. Semi-structured data, for example, could refer to a collection of unstructured data objects that each store data in a uniform way. Imagine, for example, that a consulting company has a form that they send to all of their new clients to gather information about the client’s goals, priorities, and budget.
Because this consulting firm sends out the exact same form to each client, a collection of completed form documents could be considered to be semi-structured data. The data is unstructured in that it resides in a collection of individual document files. However, each of the document files stores client responses uniformly, in designated fields within the document.
Tax forms are another example of semi-structured data. If you send in your tax return on paper, then the taxing agency knows the location of each individual data field within the document because the document is based on a standardized form.
Although a collection of forms is a good example of semi-structured data, form data is not the only semi-structured data type. It’s also possible for a collection of documents to include similar data fields, even though the formatting is completely different from one document to the next.
Imagine for a moment that a particular business leverages the services of several different vendors and that for those vendors to get paid, they have to submit an invoice at the end of the month. Because each vendor is a unique business, no two vendors’ invoices will look the same. Although the vendor’s invoices are cosmetically different from one another and formatted differently, they all contain the same types of information. Each invoice might, for example, include a date, an account number, and a purchase order number. In other words, there are definable data fields within the invoices, but the invoices are not based on standardized forms.
Two data processing models
SharePoint Syntex offers two different data processing models that can be used when extracting data from documents. The first of these models is a forms processing model. Not surprisingly, the forms processing model is used when each document that needs to be processed is based on a standardized form or adheres to a standard format. To use this model, SharePoint needs to know where the data resides within the document.
The forms processing model can be used on documents in PDF, JPG, or PNG format. Although Office documents are conspicuously absent from the list of supported file types, the forms processing model can handle text-based PDFs, or it can perform optical character recognition on image-based PDF files, JPG files, or PNG files.
The other type of data processing model that is available within SharePoint Syntex is the document understanding model. This is the model you would use for documents with a dissimilar layout but containing standardized types of information.
Unlike the forms processing model, the document understanding model supports PDF files, Microsoft Office documents, and email messages. Although this model does allow for the use of optical character recognition, it does not support JPG or PNG files.
One of the biggest differences between the forms processing model and the document understanding model is how the two models are set up. The forms processing model is created in AI Builder and relies on a classifier to tell the system where the various data fields reside within the form. Incidentally, this type of model can only be applied to a single document library.
The document understanding model works completely differently. Rather than creating the model in AI Builder, you can create a document understanding model in a special purpose SharePoint site known as the Content Center. Once created, the model can be applied to multiple SharePoint document libraries.
Because there is no way to know where the data will reside within a given document, SharePoint Syntex uses machine learning to train the model. When creating a document understanding model, you will typically have to provide five to 10 sample documents to train the model. Additionally, you will also need to provide some negative training samples that can help SharePoint differentiate between a document containing data and a document that was placed in a library by mistake and does not contain extractable data.
Use the model best suited for your purposes
Both models do a good job of extracting data from document files, but it is important to use the model that is best suited to the document contents. If given a choice, it is also better to use text-based documents rather than graphical ones. Doing so eliminates the need for optical character recognition and can, therefore, potentially reduce mistakes.
Featured image: Shutterstock