Amazon Textract is a machine learning (ML) service that enables automatic extraction of text, handwriting, and data from scanned documents, surpassing traditional optical character recognition (OCR). It can identify, understand, and extract data from tables and forms with remarkable accuracy. Presently, several companies rely on manual extraction methods or basic OCR software, which is tedious and time-consuming, and requires manual configuration that needs updating when the form changes. Amazon Textract helps solve these challenges by utilizing ML to automatically process different document types and accurately extract information with minimal manual intervention. This enables you to automate document processing and use the extracted data for different purposes, such as automating loans processing or gathering information from invoices and receipts.
As travel resumes post-pandemic, verifying a traveler’s vaccination status may be required in many cases. Hotels and travel agencies often need to review vaccination cards to gather important details like whether the traveler is fully vaccinated, vaccine dates, and the traveler’s name. Some agencies do this through manual verification of cards, which can be time-consuming for staff and leaves room for human error. Others have built custom solutions, but these can be costly and difficult to scale, and take significant time to implement. Moving forward, there may be opportunities to streamline the vaccination status verification process in a way that is efficient for businesses while respecting travelers’ privacy and convenience.
Amazon Textract Queries helps address these challenges. Amazon Textract Queries allows you to specify and extract only the piece of information that you need from the document. It gives you precise and accurate information from the document.
In this post, we walk you through a step-by-step implementation guide to build a vaccination status verification solution using Amazon Textract Queries. The solution showcases how to process vaccination cards using an Amazon Textract query, verify the vaccination status, and store the information for future use.
Solution overview
The following diagram illustrates the solution architecture.
The workflow includes the following steps:
- The user takes a photo of a vaccination card.
- The image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
- When the image gets saved in the S3 bucket, it invokes an AWS Step Functions workflow:
- The Queries-Decider AWS Lambda function examines the document passed in and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for our example, we have four queries).
NumberQueriesAndPagesChoice
is a Choice state that adds conditional logic to a workflow. If there are between 15–31 queries and the number of pages is between 2–3,001, then Amazon Textract asynchronous processing is the only option, because synchronous APIs only support up to 15 queries and one-page documents. For all other cases, we route to the random selection of synchronous or asynchronous processing.- The
TextractSync
Lambda function sends a request to Amazon Textract to analyze the document based on the following Amazon Textract queries:- What is Vaccination Status?
- What is Name?
- What is Date of Birth?
- What is Document Number?
- Amazon Textract analyzes the image and sends the answers of these queries back to the Lambda function.
- The Lambda function verifies the customer’s vaccination status and stores the final result in CSV format in the same S3 bucket (
demoqueries-textractxxx
) in thecsv-output
folder.
Prerequisites
To complete this solution, you should have an AWS account and the appropriate permissions to create the resources required as part of the solution.
Download the deployment code and sample vaccination card from GitHub.
Use the Queries feature on the Amazon Textract console
Before you build the vaccination verification solution, let’s explore how you can use Amazon Textract Queries to extract vaccination status via the Amazon Textract console. You can use the vaccination card sample you downloaded from the GitHub repo.
- On the Amazon Textract console, choose Analyze Document in the navigation pane.
- Under Upload document, choose Choose document to upload the vaccination card from your local drive.
- After you upload the document, select Queries in the Configure Document section.
- You can then add queries in the form of natural language questions. Let’s add the following:
- What is Vaccination Status?
- What is Name?
- What is Date of Birth?
- What is Document Number?
- After you add all your queries, choose Apply configuration.
- Check the Queries tab to see the answers to the questions.
You can see Amazon Textract extracts the answer to your query from the document.
Deploy the vaccination verification solution
In this post, we use an AWS Cloud9 instance and install the necessary dependencies on the instance with the AWS Cloud Development Kit (AWS CDK) and Docker. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.
- In the terminal, choose Upload Local Files on the File menu.
- Choose Select folder and choose the
vaccination_verification_solution
folder you downloaded from GitHub. - In the terminal, prepare your serverless application for subsequent steps in your development workflow in AWS Serverless Application Model (AWS SAM) using the following command:
- Deploy the application using the
cdk deploy
command:Wait for the AWS CDK to deploy the model and create the resources mentioned in the template.
- When deployment is complete, you can check the deployed resources on the AWS CloudFormation console on the Resources tab of the stack details page.
Test the solution
Now it’s time to test the solution. To trigger the workflow, use aws s3 cp
to upload the vac_card.jpg
file to DemoQueries.DocumentUploadLocation
inside the docs folder:
The vaccination certificate file automatically gets uploaded to the S3 bucket demoqueries-textractxxx
in the uploads folder.
The Step Functions workflow is triggered via a Lambda function as soon as the vaccination certificate file is uploaded to the S3 bucket.
The Queries-Decider Lambda function examines the document and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for this example, we use four queries—document number, customer name, date of birth, and vaccination status).
The TextractSync
function sends the input queries to Amazon Textract and synchronously returns the full result as part of the response. It supports 1-page documents (TIFF, PDF, JPG, PNG) and up to 15 queries. The GenerateCsvTask
function takes the JSON output from Amazon Textract and converts it to a CSV file.
The final output is stored in the same S3 bucket in the csv-output folder as a CSV file.
You can download the file to your local machine using the following command:
The format of the result is timestamp
, classification
, filename
, page number
, key name
, key_confidence
, value
, value_confidence
, key_bb_top
, key_bb_height
, key_bb.width
, key_bb_left
, value_bb_top
, value_bb_height
, value_bb_width
, value_bb_left
.
You can scale the solution to hundreds of vaccination certificate documents for multiple customers by uploading their vaccination certificates to DemoQueries.DocumentUploadLocation
. This automatically triggers multiple runs of the Step Functions state machine, and the final result is stored in the same S3 bucket in the csv-output folder.
To change the initial set of queries that are fed into Amazon Textract, you can go to your AWS Cloud9 instance and open the start_execution.py file. In the file view in the left pane, navigate to lambda, start_queries
, app
, start_execution.py
. This Lambda function is invoked when a file is uploaded to DemoQueries.DocumentUploadLocation
. The queries sent to the workflow are defined in start_execution.py
; you can change those by updating the code as shown in the following screenshot.
Clean up
To avoid incurring ongoing charges, delete the resources created in this post using the following command:
Answer the question Are you sure you want to delete: DemoQueries (y/n)?
with y.
Conclusion
In this post, we showed you how to use Amazon Textract Queries to build a vaccination verification solution for the travel industry. You can use Amazon Textract Queries to build solutions in other industries like finance and healthcare, and retrieve information from documents such as paystubs, mortgage notes, and insurance cards based on natural language questions.
For more information, see Analyzing Documents, or check out the Amazon Textract console and try out this feature.
About the Authors
Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.
Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practices through the implementation of the Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.