Blog

MongoDB – Harmonizing the Data modelling

01 Jul, 2021
Xebia Background Header Wave

Moving from traditional Relational database systems to document-oriented database systems (also known as No-SQL) is a part of transformation and modernization of applications. But, there is a possibility of either over or under engineering while modelling the data for an application. This paper narrates the optimal ways of data modelling for document-oriented database system in order to achieve optimal performance.

Relation database systems organize data in the form of tables having rows and columns, whereas a document-oriented database organizes the data in JSON (JavaScript Object Notation) format as a document.

From the data organization perspective, major difference between SQL and No-SQL database system is, relational database system stores the data for any given object, typically across multiple tables by applying the normalization principles whereas the No-SQL database systems stores all the information related to an object in a single document without needing to search for in any other document. The subtle act of maintaining all the information of an object in a single document is often confused and could lead to performance issues. Care has to be taken to arrive at the optimized data-model, and follow the principles specified below.

Normalization vs De-Normalization

Neither of the principles (Normalization and De-Normalization) have an advantage over the other, however the practice depends on the data persona. The trade-off between normalization and de-normalization lies in access throughput and integrity of data.

For example, consider the below document object:

Customer: {
    ‘_id’: “cust_001”, 
    'Name':"Customer_One",
    ‘Address’:”Address details”, 
    'Orders_List': [
[0]: {
            'Type': "Sports",
            'SubType': "Tennis",
            'Item': "Racquet",
            'ItemSubType': "Wide-Band",
            'OrderStatus': "Active",
            'Quantity': 1,
            'DeliverySchedule':"2021-June-23",
            'DeliveryStatus':"In-Progress As Planned"
        }
    ]   
}

In the above example, Customer document object, which also has the Order details as a sub-document (part of the list, Composite relation) is in de-normalized state.

  • To manage the order details related to a specific customer without any further search
  • High throughput while accessing the data.

Consider a scenario where all orders that are ‘Yet to be delivered’ and of type ‘Sports’ have to be displayed and/or managed (updated), one has to scan all the customer documents and identify such documents meeting the criteria and thereafter can be managed. In this scenario it would be inefficient.

Re-engineered the data-model by following the normalization principle and it can be shown as below:

Customer: {
    'Name':"Customer_One",
    ‘Address’:”Address details”, 
    'Orders_List': [
[0]: “Order_1”
    ]   
}

Order: {
    ‘_id’: “Order_1”,
    'Type': "Sports",
    'SubType': "Tennis",
    'Item': "Racquet",
    'ItemSubType': "Wide-Band",
    'OrderStatus': "Active",
    'Quantity': 1,
    'DeliverySchedule':"2021-June-23",
    'DeliveryStatus':"In-Progress As Planned"
}

The documents of type ‘Customer’, and ‘Order’ are normalized and share associated relationship by having a reference into Order from the ‘Customer’ document.  This data-model is an optimal for the scenarios:

  • Document of type ‘Order’ may have its own life-cycle
  • Allows to reference from other documents.
  • Data consistency
  • Flexible to extend the functionality to meet different persona of the data.

Here both the principles of normalization and de-normalization have their own advantages over the other and much depends on the persona of the data.

Normalization or De-Normalization of data-model depends on the factors:

  • Consistency of the data
  • Read frequency vs Write frequency
  • Vivid scenarios of the data usage
  • Significance of read-operations performance
  • Querying the embedded document fields or the parent document fields.
  • Growing document – Document which is continuously growing is not ideal for better performance.
  • Self sufficient

Redundant

Data model contains redundant data because:

  • Same data is used in different scenarios.
  • Read-performance is more important than write-performance.

For typical relational data model designers, this appears to be awkward, however well-designed and orchestrated write operations help in improving the read-performance significantly.

Considering the same example, the data model can be re-engineered as:

Customer: {
    ‘_id’: “cust_001”, 
    'Name':"Customer_One",
    ‘Address’:”Address details”, 
    'Orders_List': [
[0]: {
            'Type': "Sports",
            'SubType': "Tennis",
            'Item': "Racquet",
            'ItemSubType': "Wide-Band",
            'OrderStatus': "Active",
            'Quantity': 1,
            'DeliverySchedule':"2021-June-23",
            'DeliveryStatus':"In-Progress As Planned"
        }
    ]   
}
Order: {
    ‘_id’: “Order_1”,
    'Type': "Sports",
    'SubType': "Tennis",
    'Item': "Racquet",
    'ItemSubType': "Wide-Band",
    'OrderStatus': "Active",
    'Quantity': 1,
    'DeliverySchedule':"2021-June-23",
    'DeliveryStatus':"In-Progress As Planned"
}

Here the document of type ‘Order’ is redundant, however such a model enables:

  • High read performance
  • Availability of data for read operations in vivid scenarios; that is pre-populate the data and making it available for each specific scenario.

However, a data model having redundant data requires multiple write operations and a good design of orchestration is needed to ensure no significant impact due to multiple write-operations on the system.

Control vs Data

While modelling the data, often the principle of keeping all the information related to an object in a single document, misleads to storing the information related to controlling the data along with the actual data itself. For example, the below document-data indicates an Order of type ‘Sports’

Order:
{
    'Type': "Sports",
    'SubType': "Tennis",
    'Item': "Racquet",
    'ItemSubType': 'Wide-Band',
    'Quantity': 1
    'Role': {
        'Type':'OrderManagement',
        'SubType':'Sports'
    }
}

Well, this document contains information related to an ‘Order’, however it also contains an information related to who can manage the document, that is indicated by sub-document ‘Role’. Here ‘Role’ information is to control the actual document. This approach has the draw-back, as:

  • Keeping the information related to control part of the document, which does not indicate information related to ‘Order’ itself
  • In order to search for the documents based on control information, the number of documents to be scanned are proportionate to the number of document present in the system
  • As the system scales-up, could lead to performance issue as the documents to be searched increases continuously.

An optimal modelling of the data can be viewed in this scenario as:

Order:
{
    'Type': "Sports",
    'SubType': "Tennis",
    'Item': "Racquet",
    'ItemSubType': 'Wide-Band',
    'Quantity': 1
}

RoleMapping:
{
    'Type': "OrderManagement",
    'SubType': "Sports",
    'OrderTYpe': {
        'Type':'Sports',
        'SubType':'Tennis'
    }
}

Here the document is normalized by keeping the information related to controlling the document in a separate document. By using the underneath feature called ‘Indexing’ of the database system, all the order documents of type ‘Sports’ and ‘Tennis’ can be fetched for a user role of type ‘Order Management’, and ‘Sports’; a quicker and direct search of the documents has far better performance compared to an earlier data-model.

In this scenario, the essence of modelling was based on data that characterizes itself.

Sub-Documents and Arrays

MongoDB expects the arrays to be of consistent size and is not suitable for continuous growth of an array. Considering this fact, while embedding the document objects, in another document, containment of documents in an array would not be an optimal choice of design, normalizing of the documents is an optimal choice of design.

Conclusion

  • Depending on the domain and the contextual situation of managing the data, different techniques can be deployed to model the data for an optimal performance.
  • Separate the data and control parts of a document although they are related.
  • Apply normalization and/or de-normalization depending on persona.

For more insights, reach out to our team of MongoDB experts constantly engaged in improving management of various data systems. For more information, please visit – https://www.xebia.com/contact

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts