Introduction
Companion - a person or animal with whom one spends a lot of
time or with whom one travels.
This is not a typical user guide but a collective experience
of working on multiple migration projects and typically on relational to nosql
migration. This companion will help you if you are planning to migrate any data
source in general and as a typical use case, this companion talks about RDBMS
to MongoDB migration.
General migration guidelines
Every data migration process is unique but still has few
characteristics in common which makes all the data migration processes to
follow certain rules to make it a success. We’re not going to talk about the
overall planning aspect as its too general, every project needs planning, what
we’re going to talk is specific to data migration guidelines
Know your data
The
most important thing is to know your data before you even plan and decide the
implementation path. Without knowing your data if you plan an implementation strategy,
it may completely scrap or halt overall migration process. Answers to the
following questions gives more insights into knowing your data well.
A.
How big is your data – Know this and you get
answers to the questions like What technology to choose, how to architect the
migration process so it can scale, hardware requirement.
B.
Is all data required on target – There is always
a class of data which is temporal like work directories, work tables, logs,
data or schemas that were added due to technical constraints.
C.
Data Types – Know all the datatypes used on the
source side. This helps in understanding what are the available options on
target side and to know can all datatypes have a replacement on target side.
location specific data
Location
specific data means data that represents a specific location say a file path,
URL or an IP address, or even in some cases domain users. This requires a deep
dive and understanding of the application to which the data belongs but at the
same time gives you a fine grain control over the migration process. Knowing
your location specific data helps you define your data transformation strategy
Embedded data
Lot of applications make use of a single column in RDBMS
table to store bunch of information together, typically either in XML or JSON
format. Certainly, not a best practice in RDBMS world, but in some cases, makes
life easy for the application developers, or in some cases makes the
application perform a bit better. Knowing such embedded data in advance makes
it easy while architecting the data transformation process in migration.
Encoding
Not always you have the luxury of having your data in
perfect format, having your data in Unicode format makes your life much easy
while migration, but if its not then there are additional steps need to be
taken to find out an easiest possible path for migration. Use the source
database capabilities to analyze the source character set to figure out the
best solution. Like oracle provides utility to scan the database character set
(CSSCAN) which creates a report which can be used to plan the character set
migration.
Define transformation strategy
Though
birds have a mastery (because they are doing it since dawn of life on the
planet) but migration has never been or never will be an easy ride. A lot of
considerations has to go in and one of it is transformation to adjust to the
target environment.
From source side knowing the answers to the questions like
do I have location specific data or do I have embedded data helps define data
transformation strategy as to how deep you have to go to do the data
transformation, Lets say you store IP addresses of all the machines in your
network or path of the images files, which may or may not be accessible on the
target environment, You can either put some default or empty values or can have
a data mapping for each value if the value set is small.
Datatypes transformation – Most of the data sources be it
RDBMS or noSQL has similar kind of data types but in some cases, you should
transform the data types because, the target environment doesn’t have the exact
data type or precision.
Migration validation plan
Having
passed on through all the hurdles you and your convoy reach destination, now
the big question is Have you reached correctly, have you reached safely, are
some of your guys missing, each one is safe and sound, is there any damage to
the convoy, if yes What should be done to mitigate the damage.
Reaching the destination is half the battle won, what comes
afterwards is a tedious process of validation. With considerable amount of data
to validate having some sort of automation in place makes your life always
easy.
Start from very basic validations and then drill into
specifics.
Basic validations can include, though they are not very
precise indicators of a valid data migration but takes you to a comfort zone
from which you can start some serious validations.
Some validations to consider,
1.
Size validation – Source data marked for
migration and target size are close to each other.
2.
Sample data – Pick up some sample data from
migrated system and check for data type conversions, location specific data
conversion.
3.
If you have a sample application ready, connect
that application to the migrated data and see if it behaves correctly, start
from read only functionality, where the new application will not update or add
any data, this will make sure we’re validating the migrated data and not
running into the complications of having both type of data, migrated as well as
data added or modified by the new target application.
4.
Once that test passes now try adding or
modifying the data from new application on the target environment, this is real
a confidence booster.
Last but not the least always have
plan B ready, run both the systems in parallel whenever possible. The source in
production and the target in test environment. If everything looks good, do a
partial migration (Not Again) and slowly decommission the source system.
RDBMS to MongoDB migration
Till now we talked about general data migration experiences,
A lot of new application development chooses noSQL over RDBMS for obvious
reasons, be it flexibility in schema definitions, high availability, faster access,
scaling. A lot of legacy applications are moving towards noSQL to take full
advantages of its functionality. The biggest hurdle is migration of the
traditional RDBMS data.
Plan-Migrate-Validate Cycle
By the
time you decide to move to mongoDB you realize it’s a paradigm shift and
requires a different though process to design the schema or object model.
It’s always a good practice to start small and complete the
cycle of plan-migrate-validate. Let’s take an example of a small e-commerce
site, having basic functionalities like product catalog, customer management,
shopping cart, order management. Designing the whole system schema leaves you
at a risk of disaster with a sheer volume of the data, tasks that needs to be
performed, validation of complete system, you should start everything from
scratch, you should throw away the application and start writing a new one
based on mongoDB.
If we just pick the catalog management functionality these
will be the typical tasks that will be part of the holistic migration approach
1.
Design the mongoDB initial schema for product
catalog
2.
Identify source data tables that closely relates
to the target product catalog schema
3.
Identify transformation requirements
4.
Identify application code modules to be
refactored
5.
Migrate the product catalog data from source
RDBMS to destination
6.
Write validation automation scripts to validate
the migration
7.
Test the read-only application functionality
like list products, see product details, search products
8.
Test add new product modify product, add new
attributes to product functionality
9.
Roll out this new part replacing the old part.
Repeat the same cycle for all the
functionalities like customer management, cart management and order management.
This not only make sure
undisrupted production system but gives you time to retrospect and correct any
pitfalls in the previous implementation.
Define target schema
If you
have a source and you are from a typical RDBMS background, its always tempting to
replicate the same schema on target because 1. Its easy 2. You know the drill,
but that will take you back 100 steps from where you are now. You’re moving to
mongoDB to take full advantages of the offerings. So it’s a good practice to
design a target schema first and then try to find the best fit tables on source
side which can fit to this schema. Few things to consider while designing
target schema,
1.
What functionality you’re offering through the
application
2.
How often the data will be fetched
3.
What data is always required collectively?
BLOB Migration
Sometimes you store files as BLOBs in RDBMS systems,
migrating these files as is to mongoDB might not always be possible. MongoDB
has a limit of 16 MB document size, It’s not that you cannot store files bigger
than 16 MB, you’ll have to use GridFS for that were mongoDB creates multiple
chunks of the file and stores each chunk separately. So when you read that file
mongo has to pick each chunk and return you the complete file. At the same time,
you can access and update a chunk at a time.
Files in general, are not first class objects, files are
always either part of a container object say a directory /folder or are sub
objects to a first-class object like an attachment to an e-mail or a comment.
So, it makes sense if the file should be accessed as a whole
(An image, a word document …) to create a separate collection of such files and
have a reference of it in its container object. You don’t want to download all
the file contents at one go, you probably want to list all the files first and
then user can pick and choose what files to download.
References or embedding document
The very existence of mongoDB relies on noSQL principle, so it’s
always encouraging to store all the related information together in a document,
but again real world doesn’t always run on principles, there are always caveats.
It all depends on how you’re going to access or modify the data for your
application.
Few suggestions
1.
If your application is read intensive, You
create an object and rearly modify it, keep as much as possible in single
document.
2.
If part of a document is accessed by multiple
collections and that part is modifiable then it makes sense to have a general
collection of that part of the document and reference it across multiple
collections
3.
mongoDB writes are atomic, even if you change a
single attribute value in a document entire document is rewritten, consider
this fact how often your document will change.
0 comments :
Post a Comment