A Case Study on complete Application and database infrastructure assessment, gap analysis and optimization.

Mar 27,2023

Introduction:

The company urgently needed to meet customer expectations, focus on the right SLAs that prioritize the most critical aspects of the service and align with their customers’ needs. The company also wanted to reduce costs and lastly achieve the ability to quickly launch DR site and conduct DR drills as needed.

Company Overview:

Customer is a technology company and one of the largest behavioral health and rehabilitative EHR providers in United States. Their EHR platform is used by behavioral health and human services organizations, those providing value-based care, and agencies with multiple programs and specialties.

Company’s EHR platform empowers their customers with robust clinical, administrative, and financial capabilities, including scheduling, intake, treatment planning, service documentation, consumer engagement, billing, analytics, and reporting.

Industry: SaaS solutions for behavioral health and human services organizations
Location: Tennessee, USA
Established: 2000
Employees: 800
Revenue: $170 Million

Challenges:

The Company in this case study faced major EHR platform stability challenges due to outdated and sub-optimally configured network, application, database and storage infrastructure. EHR platform’s sub-optimal performance and negative consumers’ experience led to declining customer satisfaction and loyalty. Consistently failing to meet SLAs led to decreased employee morale and productivity. Lastly, customer needed a DR Strategy for Business Continuity.

Solutions Proposed:

The company initiated a project to tackle immediate application and infrastructure performance issues. Once stabilized then strategically focus on modernizing the application, the legacy systems and improve their code quality, to reduce maintenance costs and ensure future scalability.

The company engaged Intuitive to analyze their ‘EHR platform’ and assess their applications’ performance overall and in some specific applications’ functionality areas. Intuitive provided the customer complete program management, technical leadership, and expert resources to execute this assessment engagement seamlessly, in collaboration with the customer’s IT teams.

Leveraging its Application and Infrastructure Engineering & Modernization Practice, Intuitive planned a multi-phased approach, from forensic discovery of the “As-Built” on-prem multi-datacenter environment to gathering application performance data. Upon collection of all required data, Intuitive engineers delivered detailed gap analysis of findings and thorough step-by-step plan to remediate the issues related to potential application and infrastructure performance problems.

In-scope Application Infrastructure for detailed assessment, gap analysis and remediation:

  1. Application design and Application Infrastructure (App, Web, API Environments)
  2. Compute & Storage
  3. Databases (Oracle, MariaDB, MongoDB)
  4. Data Analytics & Streaming Platform
  5. Network & Security

Objectives and Deliverables:

Intuitive planned and executed the Application ecosystem assessment, based on the technical and business requirements, in a phased approach, targeting:

Objective Deliverable

Optimize, Enhance:

  1. Inefficient Application DBCP connection pooling
  2. Recurring appointment schedule functionality delay
  3. Appointments’ schedule submission functionality
  4. ECR Document list function
  5. Claim processing engine job
  1. Configure DBCP optimal attributes in Hibernate configuration file
  2. Modify group recurring procedure to run threads parallelly for user batch instead of sequentially running all users in Group
  3. CIH Segregation - Implement CDC process for sending scheduled appointments to CIH
  4. Tune all loops to avoid additional DB calls (Instead of making DB calls get values from hibernate entity). Fix issues with hibernate flush in Legacy session management
  5. Modify database PLSQL procedure to sort result to sort data instead of application logic Fix issues with hibernate flush in Legacy session management
  6. Modify claim engine procedure to process organizations parallelly utilizing multithread processing
  7. Modify claim engine procedure to process activities or claims parallelly utilizing multithread processing
  8. Increase cursor record fetch limit to optimal number to improve memory utilization and performance

Optimize, Enhance, Upgrade:

  1. Oracle PL/SQL procedures
  2. JDK version and dependent libraries
  3. App → DB connectivity
  4. JSP and JSF pages
  5. Optimize JS and CSS code
  6. EMAR screen loading performance
  7. Resolve Oracle DB transactions’ hanging issue
  1. Implement ORACLE_LOADER utility to process XML data for top five slow running PLSQL procedures
  2. Modify procedure flow to utilize parallel processing to reduce processing time
  3. Upgrade to JDK17 with latest dependent libs and JAR files
  4. Implement pooling wherever possible along with upgrading JDBC drivers to latest versions, or oracle client software
  5. Upgrade JSF to JSF 2.x version and utilize newly added functions to improve overall response rendering.
  6. Compress long CSS or JS codes utilizing minimizer tools like YUI Compressor or Google Closure Compiler
  7. Utilize dynamic content loading techniques such as AJAX or iFrames in JSP and remove unused or hidden DOM elements
  8. Evaluate redo log sizes, create with larger sizes to avoid the issue (Oracle DB Checkpoints tuning)

Optimize, Enhance, Improve:

  1. User experience (caused by Oracle queries’ slow response times)
  2. User experience (caused by inefficient use of MariaDB shared memory resources)
  3. Firewalls excessing pings
  4. Heavy usage of Static Route in Core and Aggregation Switches
  5. Storage infrastructure resources & reduce any overheads (Unity-380 Compute Optimization - CPU better utilization)
  6. Optimize I/O Workloads across storage volumes
  1. Multiple areas require further analysis and enhancements for Queries & Batch jobs optimization –
    1) Reduce overhead of SQLs with large version count
    2) Better optimizer statistics collection processes,
    3) Indexes tuning,
    4) Improve Bad “cardinality estimates”
  2. Increase buffer size - innodb_buffer_pool_size should be set to 70% of available RAM on the VMs
  3. Update innodb_buffer_pool_instances to higher value on high-end machines (start with the value of 4)
  4. Disable/Restrict ping on firewall outside interface
  5. Use Dynamic Routing (OSPFv2) to avoid administration overhead of multiple static routes - OR -
  6. Replace static route with dynamic route (preferred) or add correct next hop IP
  7. Create gaps in backup schedules for VMs & Filesystems backups.
  8. Create two filesystems (instead of current one) to separate VMs vs. Filesystems backups and host on separate SP on Unity
  9. Distribute I/O workloads across more volumes

Optimize, Enhance:

  1. Enhance Database Infra standard maintenance procedures for better efficiency
  2. Improve MariaDB issues triage process
  3. Enhance MongoDB ReplicaSet configuration (avoid extended downtime during failover situations) & (replication latency)
  4. Build solid ‘Build and Standard Operating Procedures’ for Database Infrastructure
  5. Utilize scripts for managing ESXi host configuration settings efficiently
  6. Maximize efficiency by utilizing templates for VMs’ deployments
  1. OS, DB, and Grid infra stdout/stderr management process enhancements
  2. DB backups & scheduling optimization
  3. Update configuration of slow_query_log to ‘ON’ and configure log_output to the value of ‘table’
  4. Node priority configuration optimization in ReplicaSet - Configure relative eligibility of members in a ReplicaSet to become a primary
  5. Disable Chained replication to reduce latency across nodes. in the ReplicaSet
  6. Implement monthly/quarterly agile patching process. Automate all such processes and mundane activities using automation:
  7. OS, RDBMS and GRID infrastructure agile patching process implementation
  8. Database DevOps practices implementation (automation)
  9. Create and Implement native vSphere scripts for ESXi host management
  10. Review VMs configuration and create a Gold Template for different OS images

Optimize, Enhance:

  1. Improve VMs provisioning process
  2. Empower Active Directory to Manage ESXi Users
  3. Enhance VCenter & vSphere configuration for better manageability
  4. Manage VMware ESXi Configuration with Host Profile
  5. Enhanced VMWare management solution, ensuring unified visibility
  1. Automate ESXi Host Imaging and provisioning process
  2. Facilitate ESXi Hosts to leverage Active Directory by joining them to the domain
  3. Enhance vCenter alerting - Configure vCenter Mail Server to receive critical alerts when alarm triggered
  4. Improve vCenter issues triage - Change Log generation to “verbose”
  5. Configure vSphere Distributed Power Management effectively
  6. Implement the Proactive HA feature in VMware vCenter
  7. Configure Host Isolation feature mechanism to handle management network connection losses in a vSphere HA cluster
  8. Utilize VMware Host Profiles to streamline configuration and ensure consistency across ESXi hosts
  9. Implement VMware Aria Operations for log management and analysis
  10. Enable Predictive DRS for proactive and intelligent virtual machine resource balancing

Improve, Enhance:

  1. Improve efficiency of overall Storage Infrastructure
  2. Storage Infrastructure better monitoring and alerting implementation
  3. Improve inefficient Storage capacity management
  4. Implement better operational processes to avoid Storage Over-subscription
  5. Misc. optimizations to be considered for Storage Infrastructure better efficiency and management
  1. Implement native data reduction features of storage subsystem
  2. Implement real-time monitoring, integrated with reporting tools, using Syslog and SNMP configurations on storage arrays
  3. Convert non-thin volumes to thin-volumes
  4. Currently pool is 282% oversubscribed, and as storage array has a pool of 51 TB, so capacity usage needs to be monitored, and growth should be planned accordingly
  5. Configure allowed specific bandwidth limitations for Non-Prod hosts/volumes to minimize non-prod I/O usage impact to Production hosts
  6. Enable SNMP Agent on storage array to help manage and monitor storage effectively
  7. Configure Syslog on all storage devices
  8. Eliminate single point of failure where appliance can failover to each other in case of any issue
  9. Configure host I/O Limits for less critical hosts/workloads including test/dev servers to make sure they don’t impact production IOPS

Secure, Enhance:

  1. Secure Application code repository credentials
  2. Enhance compliance & audit processes in MariaDB
  3. Running older NX-OS/IOS Code
  4. Restrict Firewall Policies
  5. Enhance Network security and enable Activate Lockdown Mode
  6. Deactivate vSphere (MOB - Managed Object Browser), to enhance Security
  7. Enhance ESXi security by managing passwords and implementing account lockout
  8. Better maintain CIM interface security - restrict remote applications' access to the minimum
  9. Prevent unauthorized access to ESXi hosts
  1. Implement key vault secure services like HashiCorp vault, Secrethub or git-secret to store application credentials
  2. Add QUERY_DCL also for auditing users’ configuration and management (grants, add users, revokes etc.)
  3. Upgrade NX-OS/IOS to latest cisco safe harbor image
  4. Configure rule with specific service or service groups
  5. Configure the rule with source and destination specific to IP/Subnets
  6. Enable Strict lockdown mode to restrict access
  7. Deactivate MOB – It poses a security risk as attackers might exploit it for malicious configuration changes
  8. Modify ESXi host's predefined requirements for enhanced security by adjusting the length, character class requirement, or enabling passphrases
  9. Control Access for CIM-Based Hardware Monitoring Tools
  10. Configure timeout settings for Idle ESXi Shell Sessions

Results and Impact:

The detailed assessment, gap analysis and remediation to result in several measurable benefits:

  1. Performance gain of 10-15% faster transactions
  2. Better user experience –up to 30% better this application’s specific functionality response time
  3. Increased application availability - 25% reduced critical incidents
  4. Faster claims processing due to 40% reduced jobs duration
  5. 10-15% Increase in webpage responsiveness
  6. Enhanced user experience - 10-15% database calls’ response improvement
  7. Faster response times, reduced latency, and better overall application performance
  8. Upgrading to more secure JDK and JDBC drivers to help maintain compliance
  9. 15-20% Improved Response time, and better overall application performance
  10. Operate continuously and without downtime for a long period of time
  11. Enhanced database performance, ensuring smooth user experience - Performance gain of 20-30%
  12. Improved application performance and scalability
  13. Increased Infrastructure uptime & resiliency
  14. 20-30% improved Application Infrastructure Availability
  15. Better application response time
  16. Improved capacity management & utilization
  17. Better consumer experience
  18. Faster issues resolution, enhanced operational efficiency, improved user experience
  19. Well protected, continued product innovations with improved features
  20. Operate continuously, avoid outages
  21. Apps to remain responsive even in the face of unexpected failures
  22. Well protected, continued product innovations with improved features
  23. Reduced maintenance overhead
  24. Increased efficiency and productivity; reduced errors
  25. Speed to the market & error prone delivery
  26. VMs creation - hardened, patched, and properly configured operating system deployments
  27. Speed to the market
  28. Better Management of overall power consumption of ESXi hosts
  29. Increased Productivity - Real time alerts, Capacity planning, Compliance reporting, and better root cause analysis with Self-healing capabilities
  30. Drive cost savings and operational enhancements through improved efficiency of overall Storage Infrastructure
  31. Better capacity gains and better visibility into consumption
  32. Protected/Secured applications from common vulnerabilities
  33. Enhanced data security, access control, and compliance measures
  34. Overall, much improved Infrastructure Security