Document Image Classification (DIC): identifies the type of document present in an image file or bitmap, whether scanned, handwritten, typed, or printed. The DIC model can identify if there is a document present in the image and, if it exists, what the document is (receipt, invoice, personal ID, etc.) Otherwise, if the image does not contain any valid document, the DIC model classifies the image as fake.
Optical Character Recognition (OCR): recognizes characters from an image or bitmap file. Thus, through OCR, it is possible to obtain a text file that is editable by a computer.
Document Field Classification (DFC): extracts values, corresponding to pre-established fields, from sets of raw texts organized in lists (for example, extracted by OCR module). Examples of these fields include dates, monetary values, ID numbers, and so on. DFC uses predefined regex to identify determining fields in the text or predefined keywords. It also uses fuzzy logic to handle misspelled words.
For instance, a Telco company may be interested in updating a database of their customers using a chatbot to collect and validate the required information. This required information may include proof of address (which could be present in an image of a utility bill) and personal ID (which could be present in an image of a driver's license). In order to extract the appropriate values from the images, following configuration would be necessary:
For DIC: a data set with at least 100 samples of images for each category (for example, utility bill and driver's license). Then, we can train the DIC model using this data set. By default, we also train the DIC model using diverse samples which do not contain documents present in images. This training is used to establish the fake category. After training, the obtained model will be ready for use.
For DFC: the regex patterns for each field to be extracted. For instance, in order to extract CNPJ (General Taxpayers' Registry in case of companies), which follows this pattern XX.XXX.XXX/XXXX-XX (where X represents a 0-9 digit), the regex used is:
Note that IRIS was developed to support all three modules (DIC, OCR, and DFC) in order to perform image comprehension. However, you may only want to use one of those modules in your application. For instance, DIC can be used separately in order to determine if an image contains the required document, without the need to extract text and classify fields. Similarly, you might just want to extract all text present in the images (OCR), without needing to classify the image and fields. For this reason, you can configure your application to access any of those modules individually.
To learn more or to help you get started with IRIS, feel free to contact Sinch or if you are an existing customer reach out to your Sinch Account Manger.