top of page

Document Management System using NOSQL

Technical White Paper

Author Prem Kumar Unnikrishnan

Date Of Creation 8/1/2017

Revision Date

Revision Number 1.0

File Name Document Management System using NOSQL


Table of Contents

1. INTRODUCTION

1.1 STORAGE REPOSITORIES

1.2 METADATA

1.3 VALIDATION

1.4 SECURITY AND ACCESS CONTROL

1.5 INDEXING AND SEARCH

1.6 VERSION CONTROL


2. DOCUMENT MANAGEMENT SYSTEM USER BENEFITS

2.1 EASY ACCESS

2.2 EASY TO SEARCH

2.3 BETTER COLLABORATION

2.4 ADDED SECURITY

2.5 SAVES SPACE

2.6 DISASTER RECOVERY


3. ARCHITECTURE


4. SERVICE

4.1 CREATING A SERVICE

4.2 METHODS EXPOSED

4.2.1 Check-In File

4.2.2 Retrieve files

4.2.3 Update files

4.3 RESPONSE DATA

5. SECURITY CONFIGURATIONS


6. CONFIGURATIONS 12

6.1 REPOSITORY CONFIGURATION

6.1.1 DocumentType

6.1.2 DocumentIdPrefix

6.1.3 Permissions

6.1.4 Repository Name

6.1.5 Class Name


7. DATABASE

7.1 FILE STORAGE

7.2 METADATA ARRANGEMENT

7.3 INDEXING SERVER

7.3.1 Unique Index

7.3.2 Common Index

7.3.3 Reverse Index


1. Introduction

Document Management System (DMS) is a software system used to store and manage electronic

documents and images of paper based information. It's a challenge for companies to keep up with all the paperwork and electronic files that come into office or business every day. It generally starts slowly an email here, a receipt there, incoming invoices and customer letters. And before you know that you have got a mountain of paper and no way to find the documents you need. Here comes the importance of Document Management System also called DMS. It organizes all digital files generated throughout an organization in a central location using certain process.


DMS mainly incorporate,

  1. Storage repositories

  2. Metadata

  3. Validation

  4. Security and access control

  5. Indexing and Search

  6. Check-in systems

  7. Information retrieval systems

  8. Version control


1.1 Storage Repositories

There could be much variety of documents available in an organization like receipt notices,

invoices, letters, approvals, requests, information, emails etc. It is always a better choice to

have a different repository for each kind of documents. So that it can be maintained and

accessed effectively.


1.2 Metadata

Metadata is a key in the document management system. Generally while storing the

documents in a repository, including few of its related data will help end users to identify

the documents easily. Because the document related keyword is much easier to remember

than the actual document name.


1.3 Validation

While storing various types of documents, there could be a need that the documents would

not be allowed to store into a repository without certain metadata values. Also some

mandatory key values might be needed to search for a document in a repository. In that

case, DMS process should restrict or inform the systems to pass the appropriate metadata

values to store or retrieve the documents in a repository.


1.4 Security and access control

Documents would be checked in and retrieved by many systems. Each type of documents

might be having access to different set of users and organization might want to restrict the

document repositories to be accessed only by related users. In that case, DMS process has

to be designed with security and allow only related group of users to access their own

business data.


1.5 Indexing and Search

Since an organization can have millions of documents stored in a repository, retrieval of

documents from the storage is not an easy task. Through sophisticated search engines,

document management systems allow for quick access to any document or file. This can be

achieved by indexing the metadata stored along with documents. Few metadata might be

having distinctive values for each document and few may have common metadata values.

Based on the need, different type of indexes should be built to bring out the most effective

search results.


1.6 Version control

Versioning is a process by which documents are checked in or out of the document

management system, allowing users to retrieve previous versions and to continue work

from a selected point. Versioning is useful for documents that change over time and

require updating, but it may be necessary to go back to or reference a previous copy.


2. Document Management System User Benefits

2.1 Easy access

Having such software in place means users doesn’t need to stockpile loads of files on their

desk. Instead, they can use the search to connect common repository to find the

documents needed.


2.2 Easy to search

Search using the keyword


2.3 Better collaboration

Getting rid of hard copies of documents makes it easy for employees to work with each

other.


2.4 Added security

Increased security is a huge advantage of using document management solutions.


2.5 Saves space

Exchanging paper documents for digital versions can save a tremendous amount of physical

space.


2.6 Disaster recovery

Businesses that have all of their documents stored in physical filing cabinets face the risk

that these papers might be destroyed, or that the business may lose access to them should

a disaster occur. But the common storage repository and regular data backup would help

organization to safeguard their data more effectively.


3. Architecture



3.1 External Systems

An organization has multiple systems which will generate documents or scan images as part of

their day to day business activities. These systems need an easy and secure way to store and access

these documents on need basis.


3.2 Security Layer

Requests from multiple systems need to be authenticated in order to ensure secure

communication between the client and the database. This layer will make the services endpoint secure

by adding security controls which will authenticate the clients and send the requests/responses

between client and DMS service.


3.3 Load Balancing

Load balancers ensure reliability and availability by monitoring the "health" of applications and

only sending requests to servers and applications that can respond in a timely manner.


3.4 Service Layer

A Windows Communication Foundation (WCF) service is used as a medium to communicate

with the centralized repository. Since there could be different technologies used in an organization,

every other systems may need to store or access the documents from the centralized repository. In

that case, DMS would have to provide a common accessing protocol. A WCF service provides endpoints

which client applications can use to communicate with the WCF service


3.5 Data Layer

In our design, we have chosen to store the documents and its metadata in schema less

database. It has been designed to be more flexible enough to handle unstructured data and to

scale towards humongous data volumes. The data in the database are organized using Network

Attached Storage (NAS) device, which allows storage and retrieval of data from centralized

location for authorized network users.


4. Implementation

4.1 DMS Service

The fundamental purpose of this web service is to control how documents are exposed

to the external applications and how client applications can interact with that functionality.

Interoperability is an important part of this, as is the ability to safely send requests through

firewalls. In its simplest form a web service receives SOAP messages over HTTP that target

specific operations exposed by the service boundary.


As shown in the above diagram the documents which are checked in by different

systems via the service are stored separately as documents and metadata. The service will split

the documents from the incoming request and send the documents to its respective file storage

path. Next, the corresponding metadata of the document are captured from the service and

send to the DB file available in the index path.


DMS service will read the metadata value from the service and it finds the corresponding

storage path from the configuration based on the document type value to store the documents in

the file shared path. This document storage path will be captured in the metadata file to identify the

document during the search.


The service will convert the metadata information into JSON format once the request is received

by the service and then it sends the metadata value to the DB file in the index path, which is

configured for each document type in the configuration file. Now we have set up the storage of

documents in its respective location.


To secure the web service, we chose windows authentication over SSL and it will be

done through data power.


4.2 Methods exposed

The basic operations performed in DMS are exposed as methods which are consumed by the

other applications. As mentioned in the previous sections, DMS is all about uploading documents

along with its respective metadata. So a method should allow consumer to send or receive the file

path and metadata.


Metadata

Metadata is a combination of Key and its Value. Since there could be multiple metadata

value, method has been designed to accept the metadata as list of Key Value pairs.


In addition, there could be a segregation based on the department or business, which could

be based on an additional parameter in the method. In our design it has been called as

ApplicationID.


Always there will be mandatory parameter in the list of metadata to identify the correct

repository to store under the corresponding ApplicationID, here we are calling it as DocumentType.

It is a mandatory parameter for all the methods exposed below.


Configurations

Repository Configuration:

This configuration file is the bridge between the web service request and the DMS repository.

Below properties are defined in the config file to identify the repositories to store/access the data.


Based on the type of documents, different repositories are created and it is identified based on

the primary metadata value, which is named as DocumentType. Here we need some more information

to access the right repositories for each type of document.


While consumers accessing the methods, the initial call will be directed to repository

configuration file to identify the right value to process the request (check-in/retrieve/update)


4.2.1 DocumentType

Document Type in the repository configuration is vital parameter to find out the

configuration for the document to be uploaded or to search. While consumers accessing the

methods,


4.2.2 DocumentIdPrefix

Since we are going to have multiple repositories, just numbering a document would be

difficult to find the UniqueIDs. So the prefix has been appended to the unique number.

For Example: DMSABC will generate the DocumentId as DMSABC-01, DMSABC-02 etc.


4.2.3 Permissions

As mentioned in the security configuration section, read and write access for each

repository can be configured here. So that during the actual implementation of methods,

information can be retrieved from the repository configuration file and utilized to allow or

restrict the users.


4.2.4 Repository Name

Name of the repository where the document are getting stored or searched

corresponding to the type of document.


4.2.5 Class Name

From the implementation perspective, each type of document might have different set

of metadata, validations, search parameters and response values. Based on the business need it

can differ. In such cases, different classes are created to segregate the functionalities like

response metadata, validation, search query etc. Since each type of document has its own

behaviours in the implementation, the class name is necessary to be configured in the

configuration file. So that it will be easier to access from the operations.


When a request is made to access/store documents, the ApplicationID in the request body

will determine which repository it should direct the request to


4.2.6 Check-In File

Basically the first operation in DMS is checking in the document into a repository. This

would need a method in service to upload the documents along with metadata. First we will

look into a way to upload the documents from a shared path. Assume the method has been

designed as below,


CheckInFile(string applicationId, string sourceFilePath, List<KeyValuePair> metaData, CheckInOptions checkInOptions)

Where,

ApplicaitonID = Choose value based on the business need.

sourceFilePath = path from where the document to be uploaded

metadata = Metadata value w.r.t the document

checkInOptions = None or 1

  • None will not delete the uploaded document from the shared path after the file upload.

  • 1 will delete the file from the shared path after the file upload.


Another option has been provided to upload the document as byte array along with the

metadata.


CheckInFileUsingPayload(string applicationId, string fileName, byte[] fileData, List<KeyValuePair> metaData)

Where,

ApplicaitonID = Choose value based on the business need.

fileName = Name of the document to be uploaded

fileData = array of bytes extracted from the document to be uploaded

metadata = Metadata value w.r.t the document


Suppose the string value needs to be uploaded as a file into the DMS repository along

with metadata, below method can be used. Mostly below method could be used to store the

messages,


CheckInFileUsingText(string applicationId, string fileName, string fileData, List<KeyValuePair> metaData)

Where,

ApplicaitonID = Choose value based on the business need.

fileName = Name of the document to be uploaded

fileData = string value

metadata = Metadata value w.r.t the document


4.2.7 Retrieve files

When documents are stored in document repository, depending upon how the system

is set up and on which users are granted access, documents can also be retrieved globally.


Retrieval of documents also classified into multiple methods as shown below,

Below method would return one or more results based on the search parameters

passed, where it could accept list of metadata values to search for the relevant data.


GetFilesByMetaData(string applicationId, List<KeyValuePair>metaData)

Where,

ApplicaitonID = Choose value based on the business need.

metadata = Combination of search metadata values based on the need.


In some scenarios, organization would expect top most record to be returned based on

the combination of search parameters or may need to search for a particular document based

on the unique values, in such cases below method can be used which will return only one

result.


GetFirstFileByMetaData(string applicationId, List<KeyValuePair> metaData)

Where,

ApplicaitonID = Choose value based on the business need.

metadata = Combination of search metadata values based on the need.


To get the count of document in a particular repository, below method can be used. This

method also accepts combination of metadata and so the number of documents in repository

can identify for certain values.


For Example, to get the number of employee documents who born in particular date.


GetFileCountByMetaData(string applicationId, List<KeyValuePair>metaData)

Where,

ApplicaitonID = Choose value based on the business need.

metadata = Combination of search metadata values based on the need.


4.2.8 Update files

In certain cases, consumers might need to update the uploaded documents or its

metadata. In such scenarios, DMS service should has provided an option to override the

files and to update the metadata.


4.2.3.1 Document update:

Documents can be updated by using the any of the Check-In file method. Just update

the document locally and send the document to service with the same document name. Usually

DMS service will look for the document with the same document name is available in repository

before checking-in, if the document is available it will override the existing one or else it will

upload it as a new document.


Metadata will also be overwritten along with the document update, so it is important to

send the same metadata with the document, if incase the same has to be maintained.

Otherwise a new metadata values or null values passed in the method will be updated.


4.2.3.2 Metadata update:

If the update is only for metadata, then consumers can use the below method. Here the

update will be based on the unique metadata value which is called DocumentID. It is mandatory

to send the DocumentID along with the update method to find out the actual document to

update.


UpdateFileMetaData(string applicationId, string DocumentId, List<KeyValuePair> metaData)

Where,

ApplicaitonID = Choose value based on the business need.

DocuemntId = UniqueID of the document to be updated

metadata = list of metadata values to be updated.


4.3 Response data

Whenever the document is checked-in or retrieval is requested, response XML has to be

sent back in a fixed format. Consumers expect return data type to get the data and process. We

have defined response data in class, where its members are used by the consumers to process

the results as needed.


In our design, response data is exposed in ResponseData class, where its members are

  • DocumentName – Name of the document checked-in or the document name from result.

  • DocumentId- UniqueID of the document in repository

  • DocumentUrl – URL of the document to download

  • Metadata – List of Metadata passed during the document upload.

Configure the service in the consumer application and then the consumers could be able

to access the response data using the above class.


5. Configuration

  1. Application Configuration

  2. Repository Configuration


5.1 Repository Configuration

Based on the type of documents, different repositories are created and it is identified based on

the primary metadata value, which is named as DocumentType. Here we need some more information

to access the right repositories for each type of document.


While consumers accessing the methods, the initial call will be directed to repository

configuration file to identify the right value to process the request (check-in/retrieve/update)


6. Security Configurations

DMS can have multiple repositories and each repository would be accessed by different

external systems. Security is vital for each repository during the check-in or retrieval of documents.


Authorization is done in each method of the service, so that before getting into the repository,

access can be verified and allowed to do operations. Since there are only two general operations like

Retrieval and Check-in (includes update method) to be authorized, service configuration has to verify

only for these operations accordingly. During the check-in method or update method, the write access

permission are verified and permitted accordingly. Similarly for the retrieval method only configured

read access users are permitted to do the retrieval operation.


In our design, read and write user access are configured in the repository configuration file

where each repository level configuration are made.


7. Database

Storage is the vital part of the DMS design, where the documents are getting stored. In

our design the storage is segregated into two divisions,

  1. Storage of documents

  2. Metadata information required to retrieve the documents from the repository.



7.1 File storage

In general, many organizations would prefer to isolate and store the documents based on the

type of documents, requirements and business desires. In such cases, multiple repositories are

required, to store different set of documents in its own repositories. Structure of file storage is

designed to be in a folder structure in the NAS storage, i.e., each document type will have a separate

folder to store the documents.


7.2 Metadata Arrangement

Metadata is data that provides information about the document, which is typically stored for

each document.


For example, consider storing the employee’s profile in a database, it should contain the

information about the employee like FirstName, LastName, DOB, the date when the profile is

uploaded, user who is storing it, etc.


In NOSQL DB, the metadata is stored in an unstructured format, which is efficient to store and

retrieve the data. Basically the metadata in our database are organized in a JSON format, composed

of key and value pairs as given below,











We will not be creating a separate JSON file to store every other document’s metadata

information. Instead all the metadata information corresponding to each document will be stored in a

single file.


7.3 Index

The vital part of DMS is to retrieve the stored documents based on the search values passed.

For an effective search the stored metadata information has to be indexed and so the retrieval of

necessary documents will be faster. In our design, we have categorized indexes based on multiple

search input values namely,

1. Common Index

2. Reverse Index

3. Unique Index


All the above index information has to be captured in a configuration file during the repository

design and each repository will have a separate configuration file.


7.3.1 Common Index

Metadata which has the least set of results should be mentioned under the Common Index

section. Consider employee’s pay check is searched based on the metadata LastName and FirstName.

Here we can list the LastName under Common Index, since it has least value compared to FirstName.


Whenever the data gets inserted in the database, a tree like structure will be formed to insert

the data into it. First data will create a node in the tree, and then the second data will be placed based

on the parent node, i.e., If the next data is less than the parent node, then it insert the data on to the

left node and if it is greater than the parent node, the data will be inserted in the right side of the

parent node. Similarly it will create a tree like structure for all the data inserted and stores it in an

index file, if we have to search for any metadata value; they are all found at leaf node of the tree.

Hence searching any record will take same time because of the equal distance of the leaf nodes.

Basically the concept of B+ tree structure is followed to implement the common index.


7.3.2 Reverse Index

All the secondary search parameters should be listed under the reverse index section of index

configuration file. From the above example, FirstName can be mentioned under the reverse index

section.


7.3.3 Unique Index

Any metadata which stores unique value should be listed under unique index section. For

example, if the same employee’s pay check needs to be searched based on the EmployeeID, then we

know that each employee has a Unique ID and it can be listed under Unique Index section. Similarly

DocumentID can also be listed under unique index if it will be passed as search parameter.

0 comments

Comments


bottom of page