Technical White Paper
Author Prem Kumar Unnikrishnan
Date Of Creation 8/1/2017
Revision Date
Revision Number 1.0
File Name Document Management System using NOSQL
Table of Contents
1. INTRODUCTION
1.1 STORAGE REPOSITORIES
1.2 METADATA
1.3 VALIDATION
1.4 SECURITY AND ACCESS CONTROL
1.5 INDEXING AND SEARCH
1.6 VERSION CONTROL
2. DOCUMENT MANAGEMENT SYSTEM USER BENEFITS
2.1 EASY ACCESS
2.2 EASY TO SEARCH
2.3 BETTER COLLABORATION
2.4 ADDED SECURITY
2.5 SAVES SPACE
2.6 DISASTER RECOVERY
3. ARCHITECTURE
4. SERVICE
4.1 CREATING A SERVICE
4.2 METHODS EXPOSED
4.2.1 Check-In File
4.2.2 Retrieve files
4.2.3 Update files
4.3 RESPONSE DATA
5. SECURITY CONFIGURATIONS
6. CONFIGURATIONS 12
6.1 REPOSITORY CONFIGURATION
6.1.1 DocumentType
6.1.2 DocumentIdPrefix
6.1.3 Permissions
6.1.4 Repository Name
6.1.5 Class Name
7. DATABASE
7.1 FILE STORAGE
7.2 METADATA ARRANGEMENT
7.3 INDEXING SERVER
7.3.1 Unique Index
7.3.2 Common Index
7.3.3 Reverse Index
1. Introduction
Document Management System (DMS) is a software system used to store and manage electronic
documents and images of paper based information. It's a challenge for companies to keep up with all the paperwork and electronic files that come into office or business every day. It generally starts slowly an email here, a receipt there, incoming invoices and customer letters. And before you know that you have got a mountain of paper and no way to find the documents you need. Here comes the importance of Document Management System also called DMS. It organizes all digital files generated throughout an organization in a central location using certain process.
DMS mainly incorporate,
Storage repositories
Metadata
Validation
Security and access control
Indexing and Search
Check-in systems
Information retrieval systems
Version control
1.1 Storage Repositories
There could be much variety of documents available in an organization like receipt notices,
invoices, letters, approvals, requests, information, emails etc. It is always a better choice to
have a different repository for each kind of documents. So that it can be maintained and
accessed effectively.
1.2 Metadata
Metadata is a key in the document management system. Generally while storing the
documents in a repository, including few of its related data will help end users to identify
the documents easily. Because the document related keyword is much easier to remember
than the actual document name.
1.3 Validation
While storing various types of documents, there could be a need that the documents would
not be allowed to store into a repository without certain metadata values. Also some
mandatory key values might be needed to search for a document in a repository. In that
case, DMS process should restrict or inform the systems to pass the appropriate metadata
values to store or retrieve the documents in a repository.
1.4 Security and access control
Documents would be checked in and retrieved by many systems. Each type of documents
might be having access to different set of users and organization might want to restrict the
document repositories to be accessed only by related users. In that case, DMS process has
to be designed with security and allow only related group of users to access their own
business data.
1.5 Indexing and Search
Since an organization can have millions of documents stored in a repository, retrieval of
documents from the storage is not an easy task. Through sophisticated search engines,
document management systems allow for quick access to any document or file. This can be
achieved by indexing the metadata stored along with documents. Few metadata might be
having distinctive values for each document and few may have common metadata values.
Based on the need, different type of indexes should be built to bring out the most effective
search results.
1.6 Version control
Versioning is a process by which documents are checked in or out of the document
management system, allowing users to retrieve previous versions and to continue work
from a selected point. Versioning is useful for documents that change over time and
require updating, but it may be necessary to go back to or reference a previous copy.
2. Document Management System User Benefits
2.1 Easy access
Having such software in place means users doesn’t need to stockpile loads of files on their
desk. Instead, they can use the search to connect common repository to find the
documents needed.
2.2 Easy to search
Search using the keyword
2.3 Better collaboration
Getting rid of hard copies of documents makes it easy for employees to work with each
other.
2.4 Added security
Increased security is a huge advantage of using document management solutions.
2.5 Saves space
Exchanging paper documents for digital versions can save a tremendous amount of physical
space.
2.6 Disaster recovery
Businesses that have all of their documents stored in physical filing cabinets face the risk
that these papers might be destroyed, or that the business may lose access to them should
a disaster occur. But the common storage repository and regular data backup would help
organization to safeguard their data more effectively.
3. Architecture
3.1 External Systems
An organization has multiple systems which will generate documents or scan images as part of
their day to day business activities. These systems need an easy and secure way to store and access
these documents on need basis.
3.2 Security Layer
Requests from multiple systems need to be authenticated in order to ensure secure
communication between the client and the database. This layer will make the services endpoint secure
by adding security controls which will authenticate the clients and send the requests/responses
between client and DMS service.
3.3 Load Balancing
Load balancers ensure reliability and availability by monitoring the "health" of applications and
only sending requests to servers and applications that can respond in a timely manner.
3.4 Service Layer
A Windows Communication Foundation (WCF) service is used as a medium to communicate
with the centralized repository. Since there could be different technologies used in an organization,
every other systems may need to store or access the documents from the centralized repository. In
that case, DMS would have to provide a common accessing protocol. A WCF service provides endpoints
which client applications can use to communicate with the WCF service
3.5 Data Layer
In our design, we have chosen to store the documents and its metadata in schema less
database. It has been designed to be more flexible enough to handle unstructured data and to
scale towards humongous data volumes. The data in the database are organized using Network
Attached Storage (NAS) device, which allows storage and retrieval of data from centralized
location for authorized network users.
4. Implementation
4.1 DMS Service
The fundamental purpose of this web service is to control how documents are exposed
to the external applications and how client applications can interact with that functionality.
Interoperability is an important part of this, as is the ability to safely send requests through
firewalls. In its simplest form a web service receives SOAP messages over HTTP that target
specific operations exposed by the service boundary.
As shown in the above diagram the documents which are checked in by different
systems via the service are stored separately as documents and metadata. The service will split
the documents from the incoming request and send the documents to its respective file storage
path. Next, the corresponding metadata of the document are captured from the service and
send to the DB file available in the index path.
DMS service will read the metadata value from the service and it finds the corresponding
storage path from the configuration based on the document type value to store the documents in
the file shared path. This document storage path will be captured in the metadata file to identify the
document during the search.
The service will convert the metadata information into JSON format once the request is received
by the service and then it sends the metadata value to the DB file in the index path, which is
configured for each document type in the configuration file. Now we have set up the storage of
documents in its respective location.
To secure the web service, we chose windows authentication over SSL and it will be
done through data power.
4.2 Methods exposed
The basic operations performed in DMS are exposed as methods which are consumed by the
other applications. As mentioned in the previous sections, DMS is all about uploading documents
along with its respective metadata. So a method should allow consumer to send or receive the file
path and metadata.
Metadata
Metadata is a combination of Key and its Value. Since there could be multiple metadata
value, method has been designed to accept the metadata as list of Key Value pairs.
In addition, there could be a segregation based on the department or business, which could
be based on an additional parameter in the method. In our design it has been called as
ApplicationID.
Always there will be mandatory parameter in the list of metadata to identify the correct
repository to store under the corresponding ApplicationID, here we are calling it as DocumentType.
It is a mandatory parameter for all the methods exposed below.
Configurations
Repository Configuration:
This configuration file is the bridge between the web service request and the DMS repository.
Below properties are defined in the config file to identify the repositories to store/access the data.
Based on the type of documents, different repositories are created and it is identified based on
the primary metadata value, which is named as DocumentType. Here we need some more information
to access the right repositories for each type of document.
While consumers accessing the methods, the initial call will be directed to repository
configuration file to identify the right value to process the request (check-in/retrieve/update)
4.2.1 DocumentType
Document Type in the repository configuration is vital parameter to find out the
configuration for the document to be uploaded or to search. While consumers accessing the
methods,
4.2.2 DocumentIdPrefix
Since we are going to have multiple repositories, just numbering a document would be
difficult to find the UniqueIDs. So the prefix has been appended to the unique number.
For Example: DMSABC will generate the DocumentId as DMSABC-01, DMSABC-02 etc.
4.2.3 Permissions
As mentioned in the security configuration section, read and write access for each
repository can be configured here. So that during the actual implementation of methods,
information can be retrieved from the repository configuration file and utilized to allow or
restrict the users.
4.2.4 Repository Name
Name of the repository where the document are getting stored or searched
corresponding to the type of document.
4.2.5 Class Name
From the implementation perspective, each type of document might have different set
of metadata, validations, search parameters and response values. Based on the business need it
can differ. In such cases, different classes are created to segregate the functionalities like
response metadata, validation, search query etc. Since each type of document has its own
behaviours in the implementation, the class name is necessary to be configured in the
configuration file. So that it will be easier to access from the operations.
When a request is made to access/store documents, the ApplicationID in the request body
will determine which repository it should direct the request to
4.2.6 Check-In File
Basically the first operation in DMS is checking in the document into a repository. This
would need a method in service to upload the documents along with metadata. First we will
look into a way to upload the documents from a shared path. Assume the method has been
designed as below,
CheckInFile(string applicationId, string sourceFilePath, List<KeyValuePair> metaData, CheckInOptions checkInOptions)
Where,
ApplicaitonID = Choose value based on the business need.
sourceFilePath = path from where the document to be uploaded
metadata = Metadata value w.r.t the document
checkInOptions = None or 1
None will not delete the uploaded document from the shared path after the file upload.
1 will delete the file from the shared path after the file upload.
Another option has been provided to upload the document as byte array along with the
metadata.
CheckInFileUsingPayload(string applicationId, string fileName, byte[] fileData, List<KeyValuePair> metaData)
Where,
ApplicaitonID = Choose value based on the business need.
fileName = Name of the document to be uploaded
fileData = array of bytes extracted from the document to be uploaded
metadata = Metadata value w.r.t the document
Suppose the string value needs to be uploaded as a file into the DMS repository along
with metadata, below method can be used. Mostly below method could be used to store the
messages,
CheckInFileUsingText(string applicationId, string fileName, string fileData, List<KeyValuePair> metaData)
Where,
ApplicaitonID = Choose value based on the business need.
fileName = Name of the document to be uploaded
fileData = string value
metadata = Metadata value w.r.t the document
4.2.7 Retrieve files
When documents are stored in document repository, depending upon how the system
is set up and on which users are granted access, documents can also be retrieved globally.
Retrieval of documents also classified into multiple methods as shown below,
Below method would return one or more results based on the search parameters
passed, where it could accept list of metadata values to search for the relevant data.
GetFilesByMetaData(string applicationId, List<KeyValuePair>metaData)
Where,
ApplicaitonID = Choose value based on the business need.
metadata = Combination of search metadata values based on the need.
In some scenarios, organization would expect top most record to be returned based on
the combination of search parameters or may need to search for a particular document based
on the unique values, in such cases below method can be used which will return only one
result.
GetFirstFileByMetaData(string applicationId, List<KeyValuePair> metaData)
Where,
ApplicaitonID = Choose value based on the business need.
metadata = Combination of search metadata values based on the need.
To get the count of document in a particular repository, below method can be used. This
method also accepts combination of metadata and so the number of documents in repository
can identify for certain values.
For Example, to get the number of employee documents who born in particular date.
GetFileCountByMetaData(string applicationId, List<KeyValuePair>metaData)
Where,
ApplicaitonID = Choose value based on the business need.
metadata = Combination of search metadata values based on the need.
4.2.8 Update files
In certain cases, consumers might need to update the uploaded documents or its
metadata. In such scenarios, DMS service should has provided an option to override the
files and to update the metadata.
4.2.3.1 Document update:
Documents can be updated by using the any of the Check-In file method. Just update
the document locally and send the document to service with the same document name. Usually
DMS service will look for the document with the same document name is available in repository
before checking-in, if the document is available it will override the existing one or else it will
upload it as a new document.
Metadata will also be overwritten along with the document update, so it is important to
send the same metadata with the document, if incase the same has to be maintained.
Otherwise a new metadata values or null values passed in the method will be updated.
4.2.3.2 Metadata update:
If the update is only for metadata, then consumers can use the below method. Here the
update will be based on the unique metadata value which is called DocumentID. It is mandatory
to send the DocumentID along with the update method to find out the actual document to
update.
UpdateFileMetaData(string applicationId, string DocumentId, List<KeyValuePair> metaData)
Where,
ApplicaitonID = Choose value based on the business need.
DocuemntId = UniqueID of the document to be updated
metadata = list of metadata values to be updated.
4.3 Response data
Whenever the document is checked-in or retrieval is requested, response XML has to be
sent back in a fixed format. Consumers expect return data type to get the data and process. We
have defined response data in class, where its members are used by the consumers to process
the results as needed.
In our design, response data is exposed in ResponseData class, where its members are
DocumentName – Name of the document checked-in or the document name from result.
DocumentId- UniqueID of the document in repository
DocumentUrl – URL of the document to download
Metadata – List of Metadata passed during the document upload.
Configure the service in the consumer application and then the consumers could be able
to access the response data using the above class.
5. Configuration
Application Configuration
Repository Configuration
5.1 Repository Configuration
Based on the type of documents, different repositories are created and it is identified based on
the primary metadata value, which is named as DocumentType. Here we need some more information
to access the right repositories for each type of document.
While consumers accessing the methods, the initial call will be directed to repository
configuration file to identify the right value to process the request (check-in/retrieve/update)
6. Security Configurations
DMS can have multiple repositories and each repository would be accessed by different
external systems. Security is vital for each repository during the check-in or retrieval of documents.
Authorization is done in each method of the service, so that before getting into the repository,
access can be verified and allowed to do operations. Since there are only two general operations like
Retrieval and Check-in (includes update method) to be authorized, service configuration has to verify
only for these operations accordingly. During the check-in method or update method, the write access
permission are verified and permitted accordingly. Similarly for the retrieval method only configured
read access users are permitted to do the retrieval operation.
In our design, read and write user access are configured in the repository configuration file
where each repository level configuration are made.
7. Database
Storage is the vital part of the DMS design, where the documents are getting stored. In
our design the storage is segregated into two divisions,
Storage of documents
Metadata information required to retrieve the documents from the repository.
7.1 File storage
In general, many organizations would prefer to isolate and store the documents based on the
type of documents, requirements and business desires. In such cases, multiple repositories are
required, to store different set of documents in its own repositories. Structure of file storage is
designed to be in a folder structure in the NAS storage, i.e., each document type will have a separate
folder to store the documents.
7.2 Metadata Arrangement
Metadata is data that provides information about the document, which is typically stored for
each document.
For example, consider storing the employee’s profile in a database, it should contain the
information about the employee like FirstName, LastName, DOB, the date when the profile is
uploaded, user who is storing it, etc.
In NOSQL DB, the metadata is stored in an unstructured format, which is efficient to store and
retrieve the data. Basically the metadata in our database are organized in a JSON format, composed
of key and value pairs as given below,
We will not be creating a separate JSON file to store every other document’s metadata
single file.
7.3 Index
The vital part of DMS is to retrieve the stored documents based on the search values passed.
For an effective search the stored metadata information has to be indexed and so the retrieval of
necessary documents will be faster. In our design, we have categorized indexes based on multiple
search input values namely,
1. Common Index
2. Reverse Index
3. Unique Index
All the above index information has to be captured in a configuration file during the repository
design and each repository will have a separate configuration file.
7.3.1 Common Index
Metadata which has the least set of results should be mentioned under the Common Index
section. Consider employee’s pay check is searched based on the metadata LastName and FirstName.
Here we can list the LastName under Common Index, since it has least value compared to FirstName.
Whenever the data gets inserted in the database, a tree like structure will be formed to insert
the data into it. First data will create a node in the tree, and then the second data will be placed based
on the parent node, i.e., If the next data is less than the parent node, then it insert the data on to the
left node and if it is greater than the parent node, the data will be inserted in the right side of the
parent node. Similarly it will create a tree like structure for all the data inserted and stores it in an
index file, if we have to search for any metadata value; they are all found at leaf node of the tree.
Hence searching any record will take same time because of the equal distance of the leaf nodes.
Basically the concept of B+ tree structure is followed to implement the common index.
7.3.2 Reverse Index
All the secondary search parameters should be listed under the reverse index section of index
configuration file. From the above example, FirstName can be mentioned under the reverse index
section.
7.3.3 Unique Index
Any metadata which stores unique value should be listed under unique index section. For
example, if the same employee’s pay check needs to be searched based on the EmployeeID, then we
know that each employee has a Unique ID and it can be listed under Unique Index section. Similarly
DocumentID can also be listed under unique index if it will be passed as search parameter.
Comments