Powered by Blogger.

14 November 2016

RDBMS to noSQL migration: 10 things essential you cannot miss

Posted by: Bhaskar Kulkarni


Introduction


Companion - a person or animal with whom one spends a lot of time or with whom one travels.

This is not a typical user guide but a collective experience of working on multiple migration projects and typically on relational to nosql migration. This companion will help you if you are planning to migrate any data source in general and as a typical use case, this companion talks about RDBMS to MongoDB migration.

General migration guidelines


Every data migration process is unique but still has few characteristics in common which makes all the data migration processes to follow certain rules to make it a success. We’re not going to talk about the overall planning aspect as its too general, every project needs planning, what we’re going to talk is specific to data migration guidelines

            Know your data


                The most important thing is to know your data before you even plan and decide the implementation path. Without knowing your data if you plan an implementation strategy, it may completely scrap or halt overall migration process. Answers to the following questions gives more insights into knowing your data well.

A.      How big is your data – Know this and you get answers to the questions like What technology to choose, how to architect the migration process so it can scale, hardware requirement.

B.      Is all data required on target – There is always a class of data which is temporal like work directories, work tables, logs, data or schemas that were added due to technical constraints.

C.      Data Types – Know all the datatypes used on the source side. This helps in understanding what are the available options on target side and to know can all datatypes have a replacement on target side.

            location specific data


                Location specific data means data that represents a specific location say a file path, URL or an IP address, or even in some cases domain users. This requires a deep dive and understanding of the application to which the data belongs but at the same time gives you a fine grain control over the migration process. Knowing your location specific data helps you define your data transformation strategy

Embedded data


Lot of applications make use of a single column in RDBMS table to store bunch of information together, typically either in XML or JSON format. Certainly, not a best practice in RDBMS world, but in some cases, makes life easy for the application developers, or in some cases makes the application perform a bit better. Knowing such embedded data in advance makes it easy while architecting the data transformation process in migration.

Encoding


Not always you have the luxury of having your data in perfect format, having your data in Unicode format makes your life much easy while migration, but if its not then there are additional steps need to be taken to find out an easiest possible path for migration. Use the source database capabilities to analyze the source character set to figure out the best solution. Like oracle provides utility to scan the database character set (CSSCAN) which creates a report which can be used to plan the character set migration.

            Define transformation strategy


                Though birds have a mastery (because they are doing it since dawn of life on the planet) but migration has never been or never will be an easy ride. A lot of considerations has to go in and one of it is transformation to adjust to the target environment.

From source side knowing the answers to the questions like do I have location specific data or do I have embedded data helps define data transformation strategy as to how deep you have to go to do the data transformation, Lets say you store IP addresses of all the machines in your network or path of the images files, which may or may not be accessible on the target environment, You can either put some default or empty values or can have a data mapping for each value if the value set is small.

Datatypes transformation – Most of the data sources be it RDBMS or noSQL has similar kind of data types but in some cases, you should transform the data types because, the target environment doesn’t have the exact data type or precision.  

            Migration validation plan


                Having passed on through all the hurdles you and your convoy reach destination, now the big question is Have you reached correctly, have you reached safely, are some of your guys missing, each one is safe and sound, is there any damage to the convoy, if yes What should be done to mitigate the damage.

Reaching the destination is half the battle won, what comes afterwards is a tedious process of validation. With considerable amount of data to validate having some sort of automation in place makes your life always easy.

Start from very basic validations and then drill into specifics.

Basic validations can include, though they are not very precise indicators of a valid data migration but takes you to a comfort zone from which you can start some serious validations.

Some validations to consider, 

1.       Size validation – Source data marked for migration and target size are close to each other.

2.       Sample data – Pick up some sample data from migrated system and check for data type conversions, location specific data conversion.

3.       If you have a sample application ready, connect that application to the migrated data and see if it behaves correctly, start from read only functionality, where the new application will not update or add any data, this will make sure we’re validating the migrated data and not running into the complications of having both type of data, migrated as well as data added or modified by the new target application.

4.       Once that test passes now try adding or modifying the data from new application on the target environment, this is real a confidence booster.

Last but not the least always have plan B ready, run both the systems in parallel whenever possible. The source in production and the target in test environment. If everything looks good, do a partial migration (Not Again) and slowly decommission the source system. 

RDBMS to MongoDB migration


Till now we talked about general data migration experiences, A lot of new application development chooses noSQL over RDBMS for obvious reasons, be it flexibility in schema definitions, high availability, faster access, scaling. A lot of legacy applications are moving towards noSQL to take full advantages of its functionality. The biggest hurdle is migration of the traditional RDBMS data.  

            Plan-Migrate-Validate Cycle


                By the time you decide to move to mongoDB you realize it’s a paradigm shift and requires a different though process to design the schema or object model.

It’s always a good practice to start small and complete the cycle of plan-migrate-validate. Let’s take an example of a small e-commerce site, having basic functionalities like product catalog, customer management, shopping cart, order management. Designing the whole system schema leaves you at a risk of disaster with a sheer volume of the data, tasks that needs to be performed, validation of complete system, you should start everything from scratch, you should throw away the application and start writing a new one based on mongoDB.

If we just pick the catalog management functionality these will be the typical tasks that will be part of the holistic migration approach

1.       Design the mongoDB initial schema for product catalog

2.       Identify source data tables that closely relates to the target product catalog schema

3.       Identify transformation requirements

4.       Identify application code modules to be refactored

5.       Migrate the product catalog data from source RDBMS to destination

6.       Write validation automation scripts to validate the migration

7.       Test the read-only application functionality like list products, see product details, search products

8.       Test add new product modify product, add new attributes to product functionality

9.       Roll out this new part replacing the old part.

Repeat the same cycle for all the functionalities like customer management, cart management and order management.

This not only make sure undisrupted production system but gives you time to retrospect and correct any pitfalls in the previous implementation.   

Define target schema


                If you have a source and you are from a typical RDBMS background, its always tempting to replicate the same schema on target because 1. Its easy 2. You know the drill, but that will take you back 100 steps from where you are now. You’re moving to mongoDB to take full advantages of the offerings. So it’s a good practice to design a target schema first and then try to find the best fit tables on source side which can fit to this schema. Few things to consider while designing target schema,

1.       What functionality you’re offering through the application

2.       How often the data will be fetched

3.       What data is always required collectively? 

            BLOB Migration


Sometimes you store files as BLOBs in RDBMS systems, migrating these files as is to mongoDB might not always be possible. MongoDB has a limit of 16 MB document size, It’s not that you cannot store files bigger than 16 MB, you’ll have to use GridFS for that were mongoDB creates multiple chunks of the file and stores each chunk separately. So when you read that file mongo has to pick each chunk and return you the complete file. At the same time, you can access and update a chunk at a time.

Files in general, are not first class objects, files are always either part of a container object say a directory /folder or are sub objects to a first-class object like an attachment to an e-mail or a comment.

So, it makes sense if the file should be accessed as a whole (An image, a word document …) to create a separate collection of such files and have a reference of it in its container object. You don’t want to download all the file contents at one go, you probably want to list all the files first and then user can pick and choose what files to download.

            References or embedding document


The very existence of mongoDB relies on noSQL principle, so it’s always encouraging to store all the related information together in a document, but again real world doesn’t always run on principles, there are always caveats. It all depends on how you’re going to access or modify the data for your application.

Few suggestions

1.       If your application is read intensive, You create an object and rearly modify it, keep as much as possible in single document.

2.       If part of a document is accessed by multiple collections and that part is modifiable then it makes sense to have a general collection of that part of the document and reference it across multiple collections

3.       mongoDB writes are atomic, even if you change a single attribute value in a document entire document is rewritten, consider this fact how often your document will change.



           

0 comments :

Post a Comment