Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.
2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Exploring the New Eclipse JNoSQL Version 1.1.0: A Dive Into Oracle NoSQL
20 Days of DynamoDB
NoSQL databases provide a flexible and scalable option for storing and retrieving data in database management. However, they can need help with object-oriented programming paradigms, such as inheritance, which is a fundamental concept in languages like Java. This article explores the impedance mismatch when dealing with inheritance in NoSQL databases. The Inheritance Challenge in NoSQL Databases The term “impedance mismatch” refers to the disconnect between the object-oriented world of programming languages like Java and NoSQL databases’ tabular, document-oriented, or graph-based structures. One area where this mismatch is particularly evident is in handling inheritance. In Java, inheritance allows you to create a hierarchy of classes, where a subclass inherits properties and behaviors from its parent class. This concept is deeply ingrained in Java programming and is often used to model real-world relationships. However, NoSQL databases have no joins, and the inheritance structure needs to be handled differently. Jakarta Persistence (JPA) and Inheritance Strategies Before diving into more advanced solutions, it’s worth mentioning that there are strategies to simulate inheritance in relational databases in the world of Jakarta Persistence (formerly known as JPA). These strategies include: JOINED inheritance strategy: In this approach, fields specific to a subclass are mapped to a separate table from the fields common to the parent class. A join operation is performed to instantiate the subclass when needed. SINGLE_TABLE inheritance strategy: This strategy uses a single table representing the entire class hierarchy. Discriminator columns are used to differentiate between different subclasses. TABLE_PER_CLASS inheritance strategy: Each concrete entity class in the hierarchy corresponds to its table in the database. These strategies work well in relational databases but are not directly applicable to NoSQL databases, primarily because NoSQL databases do not support traditional joins. Live Code Session: Java SE, Eclipse JNoSQL, and MongoDB In this live code session, we will create a Java SE project using MongoDB as our NoSQL database. We’ll focus on managing game characters, specifically Mario and Sonic characters, using Eclipse JNoSQL. You can run MongoDB locally using Docker or in the cloud with MongoDB Atlas. We’ll start with the database setup and then proceed to the Java code implementation. Setting Up MongoDB Locally To run MongoDB locally, you can use Docker with the following command: Shell docker run -d --name mongodb-instance -p 27017:27017 mongo Alternatively, you can choose to execute it in the cloud by following the instructions provided by MongoDB Atlas. With the MongoDB database up and running, let’s create our Java project. Creating the Java Project We’ll create a Java SE project using Maven and the maven-archetype-quickstart archetype. This project will utilize the following technologies and dependencies: Jakarta CDI Jakarta JSONP Eclipse MicroProfile Eclipse JNoSQL database Maven Dependencies Add the following dependencies to your project’s pom.xml file: XML <dependencies> <dependency> <groupId>org.jboss.weld.se</groupId> <artifactId>weld-se-shaded</artifactId> <version>${weld.se.core.version}</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.eclipse</groupId> <artifactId>yasson</artifactId> <version>3.0.3</version> <scope>compile</scope> </dependency> <dependency> <groupId>io.smallrye.config</groupId> <artifactId>smallrye-config-core</artifactId> <version>3.2.1</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.eclipse.microprofile.config</groupId> <artifactId>microprofile-config-api</artifactId> <version>3.0.2</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.eclipse.jnosql.databases</groupId> <artifactId>jnosql-mongodb</artifactId> <version>${jnosql.version}</version> </dependency> <dependency> <groupId>net.datafaker</groupId> <artifactId>datafaker</artifactId> <version>2.0.2</version> </dependency> </dependencies> Make sure to replace ${jnosql.version} with the appropriate version of Eclipse JNoSQL you intend to use. In the next section, we will proceed with implementing our Java code. Implementing Our Java Code Our GameCharacter class will serve as the parent class for all game characters and will hold the common attributes shared among them. We’ll use inheritance and discriminator columns to distinguish between Sonic’s and Mario’s characters. Here’s the initial definition of the GameCharacter class: Java @Entity @DiscriminatorColumn("type") @Inheritance public abstract class GameCharacter { @Id @Convert(UUIDConverter.class) protected UUID id; @Column protected String character; @Column protected String game; public abstract GameType getType(); } In this code: We annotate the class with @Entity to indicate that it is a persistent entity in our MongoDB database. We use @DiscriminatorColumn("type") to specify that a discriminator column named “type” will be used to differentiate between subclasses. @Inheritance indicates that this class is part of an inheritance hierarchy. The GameCharacter class has a unique identifier (id), attributes for character name (character) and game name (game), and an abstract method getType(), which its subclasses will implement to specify the character type. Specialization Classes: Sonic and Mario Now, let’s create the specialization classes for Sonic and Mario entities. These classes will extend the GameCharacter class and provide additional attributes specific to each character type. We’ll use @DiscriminatorValue to define the values the “type” discriminator column can take for each subclass. Java @Entity @DiscriminatorValue("SONIC") public class Sonic extends GameCharacter { @Column private String zone; @Override public GameType getType() { return GameType.SONIC; } } In the Sonic class: We annotate it with @Entity to indicate it’s a persistent entity. @DiscriminatorValue("SONIC") specifies that the “type” discriminator column will have the value “SONIC” for Sonic entities. We add an attribute zone-specific to Sonic characters. The getType() method returns GameType.SONIC, indicating that this is a Sonic character. Java @Entity @DiscriminatorValue("MARIO") public class Mario extends GameCharacter { @Column private String locations; @Override public GameType getType() { return GameType.MARIO; } } Similarly, in the Mario class: We annotate it with @Entity to indicate it’s a persistent entity. @DiscriminatorValue("MARIO") specifies that the “type” discriminator column will have the value “MARIO” for Mario entities. We add an attribute locations specific to Mario characters. The getType() method returns GameType.MARIO, indicating that this is a Mario character. With this modeling approach, you can easily distinguish between Sonic and Mario characters in your MongoDB database using the discriminator column “type.” We will create our first database integration with MongoDB using Eclipse JNoSQL. To simplify, we will generate data using the Data Faker library. Our Java application will insert Mario and Sonic characters into the database and perform basic operations. Application Code Here’s the main application code that generates and inserts data into the MongoDB database: Java public class App { public static void main(String[] args) { try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { DocumentTemplate template = container.select(DocumentTemplate.class).get(); DataFaker faker = new DataFaker(); Mario mario = Mario.of(faker.generateMarioData()); Sonic sonic = Sonic.of(faker.generateSonicData()); // Insert Mario and Sonic characters into the database template.insert(List.of(mario, sonic)); // Count the total number of GameCharacter documents long count = template.count(GameCharacter.class); System.out.println("Total of GameCharacter: " + count); // Find all Mario characters in the database List<Mario> marioCharacters = template.select(Mario.class).getResultList(); System.out.println("Find all Mario characters: " + marioCharacters); // Find all Sonic characters in the database List<Sonic> sonicCharacters = template.select(Sonic.class).getResultList(); System.out.println("Find all Sonic characters: " + sonicCharacters); } } } In this code: We use the SeContainer to manage our CDI container and initialize the DocumentTemplate from Eclipse JNoSQL. We create instances of Mario and Sonic characters using data generated by the DataFaker class. We insert these characters into the MongoDB database using the template.insert() method. We count the total number of GameCharacter documents in the database. We retrieve and display all Mario and Sonic characters from the database. Resulting Database Structure As a result of running this code, you will see data in your MongoDB database similar to the following structure: JSON [ { "_id": "39b8901c-669c-49db-ac42-c1cabdcbb6ed", "character": "Bowser", "game": "Super Mario Bros.", "locations": "Mount Volbono", "type": "MARIO" }, { "_id": "f60e1ada-bfd9-4da7-8228-6a7f870e3dc8", "character": "Perfect Chaos", "game": "Sonic Rivals 2", "type": "SONIC", "zone": "Emerald Hill Zone" } ] As shown in the database structure, each document contains a unique identifier (_id), character name (character), game name (game), and a discriminator column type to differentiate between Mario and Sonic characters. You will see more characters in your MongoDB database depending on your generated data. This integration demonstrates how to insert, count, and retrieve game characters using Eclipse JNoSQL and MongoDB. You can extend and enhance this application to manage and manipulate your game character data as needed. We will create repositories for managing game characters using Eclipse JNoSQL. We will have a Console repository for general game characters and a SonicRepository specifically for Sonic characters. These repositories will allow us to interact with the database and perform various operations easily. Let’s define the repositories for our game characters. Console Repository Java @Repository public interface Console extends PageableRepository<GameCharacter, UUID> { } The Console repository extends PageableRepository and is used for general game characters. It provides common CRUD operations and pagination support. Sonic Repository Java @Repository public interface SonicRepository extends PageableRepository<Sonic, UUID> { } The SonicRepository extends PageableRepository but is specifically designed for Sonic characters. It inherits common CRUD operations and pagination from the parent repository. Main Application Code Now, let’s modify our main application code to use these repositories. For Console Repository Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { Console repository = container.select(Console.class).get(); for (int index = 0; index < 5; index++) { Mario mario = Mario.of(faker); Sonic sonic = Sonic.of(faker); repository.saveAll(List.of(mario, sonic)); } long count = repository.count(); System.out.println("Total of GameCharacter: " + count); System.out.println("Find all game characters: " + repository.findAll().toList()); } System.exit(0); } In this code, we use the Console repository to save both Mario and Sonic characters, demonstrating its ability to manage general game characters. For Sonic Repository Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { SonicRepository repository = container.select(SonicRepository.class).get(); for (int index = 0; index < 5; index++) { Sonic sonic = Sonic.of(faker); repository.save(sonic); } long count = repository.count(); System.out.println("Total of Sonic characters: " + count); System.out.println("Find all Sonic characters: " + repository.findAll().toList()); } System.exit(0); } This code uses the SonicRepository to save Sonic characters specifically. It showcases how to work with a repository dedicated to a particular character type. With these repositories, you can easily manage, query, and filter game characters based on their type, simplifying the code and making it more organized. Conclusion In this article, we explored the seamless integration of MongoDB with Java using the Eclipse JNoSQL framework for efficient game character management. We delved into the intricacies of modeling game characters, addressing challenges related to inheritance in NoSQL databases while maintaining compatibility with Java's object-oriented principles. By employing discriminator columns, we could categorize characters and store them within the MongoDB database, creating a well-structured and extensible solution. Through our Java application, we demonstrated how to generate sample game character data using the Data Faker library and efficiently insert it into MongoDB. We performed essential operations, such as counting the number of game characters and retrieving specific character types. Moreover, we introduced the concept of repositories in Eclipse JNoSQL, showcasing their value in simplifying data management and enabling focused queries based on character types. This article provides a solid foundation for harnessing the power of Eclipse JNoSQL and MongoDB to streamline NoSQL database interactions in Java applications, making it easier to manage and manipulate diverse data sets. Source code
Distributed transactions, also known as XA transactions, emerged to address the complexity of coordinating transactions across multiple databases or systems in a distributed environment. Imagine you’re conducting an orchestra where each musician represents a different database or service. Just like ensuring harmony in music requires precise coordination, maintaining transactional integrity across distributed systems demands careful orchestration. This is where a two-phase commit (2PC), a vital aspect of XA transactions, steps in. 2PC acts as the conductor, ensuring that all musicians (or database participants) are ready to commit before the final note is played. Just as every instrument must be in tune before a symphony reaches its crescendo, 2PC ensures that all components of a distributed transaction are in sync before proceeding, thus guaranteeing the integrity of the transaction across the distributed landscape. Typical use cases include applications using more than one database system for the same transaction. In the Java ecosystem, a use case might be an Enterprise Application (i.e. an EAR deployed on an application server) using both JPA and JMS with JTA coordinating the distributed transaction. When it comes to MySQL being one of the systems participating, the standard flow would be: MySQL XA START <xid>; -- [... SQL Statements ...] XA END <xid>; XA PREPARE <xid>; XA COMMIT <xid>; <xid> is the transaction ID, a unique identifier generated by the transaction coordinator (i.e., JTA). When the Enterprise Application uses JPA with a persistence.xml configured to use JTA: XML <persistence-unit name="samplePU" transaction-type="JTA"> <!-- [...] --> </persistence-unit> …SQL statements get wrapped in XA transactions. Most of the time, the above flow works flawlessly. Until the application disconnects the session between PREPARE and COMMIT. This is when interesting things happen… Symptoms A transaction being stuck in the PREPARED state can go undetected for some time. As far as the application is concerned, the database was inaccessible so it might retry the transaction and succeed. But for MySQL, the transaction is still in the prepared state, waiting for a final verdict: commit or rollback. You can locate such transactions in the output of SHOW ENGINE INNODB STATUS: ---TRANSACTION 39898344, ACTIVE (PREPARED) 1314869 sec 4 lock struct(s), heap size 1128, 17 row lock(s), undo log entries 32 Ouch! More than 15 days ago. What’s worse is that you probably don’t casually check the Innodb status output: it’s the locks that will force you to investigate. Row locks or table locks will cause timeouts for no apparent reason. Then you start searching, probably using: SQL SELECT * FROM information_schema.innodb_trx; Aha! There has indeed been a transaction active for that long! Problem solved? Not yet. Here’s the surprise: TRX_MYSQL_THREAD_ID is 0. No thread running this transaction, no correlation with the running processes, and nothing to KILL to get rid of the transaction. You decide to bite the bullet and restart the server. Still no joy: the transaction is still there, waiting for the verdict. As annoying as it might be, it makes total sense. You would want the transaction to be there but don’t know how to resolve the situation. Yet… Recovery In order to commit or rollback the transaction, you only need the transaction’s ID. You can list all transactions in the PREPARED state by issuing: XA RECOVER; The output of this command is not exactly user friendly, so you might want to try the following: XA RECOVER CONVERT XID; You need XA_RECOVER_ADMIN privilege for this command, or you will get a not-very-helpful error message: SQL Error [1401] [XAE03]: XAER_RMERR: Fatal error occurred in the transaction branch — check your data for consistency This will get us the precious XID in hexadecimal. But still, this is not yet usable by XA COMMIT or XA ROLLBACK. For whatever reason, both commands expect the XID broken into three parts: xid: gtrid [, bqual [, formatID ]]: gtrid is a global transaction identifier, bqual is a branch qualifier, and formatID is a number that identifies the format used by the gtrid and bqual values. As indicated by the syntax, bqual and formatID are optional. The default bqual value is ’’ if not given. The default formatID value is 1 if not given. There is some string manipulation required: gtrid: It’s the first N bytes of the XID reported by XA RECOVER, where N is in the gtrid_length column of the same bqual: It’s the next M bytes of the XID reported by XA RECOVER, where M is in the bqual_length column of the same formatID: This is available in the column formatID When you are done with slicing: XA COMMIT <gtrid> , <bqual> , <formatID> -- ...or... XA ROLLBACK <gtrid> , <bqual> , <formatID> Problem solved!
DevOps proposes Continuous Integration and Continuous Delivery (CI/CD) solutions for software project management. In CI/CD, the process of software development and operations falls into a cyclical feedback loop, which promotes not only innovation and improvements but also makes the product quickly adapt to the changing needs of the market. So, it becomes easy to cater to the needs of the customer and garner their satisfaction. The development team that adopts the culture becomes agile, flexible, and adaptive (called the Agile development team) in building an incremental quality product that focuses on continuous improvement and innovation. One of the key areas of CI/CD is to address changes. The evolution of software also has its effect on the database as well. Database change management primarily focuses on this aspect and can be a real hurdle in collaboration with DevOps practices which is advocating automation for CI/CD pipelines. Automating database change management enables the development team to stay agile by keeping database schema up to date as part of the delivery and deployment process. It helps to keep track of changes critical for debugging production problems. The purpose of this article is to highlight how database change management is an important part of implementing Continuous Delivery and recommends some processes that help streamline application code and database changes into a single delivery pipeline. Continuous Integration One of the core principles of an Agile development process is Continuous Integration. Continuous Integration emphasizes making sure that code developed by multiple members of the team is always integrated. It avoids the “integration hell” that used to be so common during the days when developers worked in their silos and waited until everyone was done with their pieces of work before attempting to integrate them. Continuous Integration involves independent build machines, automated builds, and automated tests. It promotes test-driven development and the practice of making frequent atomic commits to the baseline or master branch or trunk of the version control system. Figure 1: A typical Continuous Integration process The diagram above illustrates a typical Continuous Integration process. As soon as a developer checks in code to the source control system, it will trigger a build job configured in the Continuous Integration (CI) server. The CI Job will check out code from the version control system, execute a build, run a suite of tests, and deploy the generated artifacts (e.g., a JAR file) to an artifact repository. There may be timed CI jobs to deploy the code to the development environment, push details out to a static analysis tool, run system tests on the deployed code, or any automated process that the team feels is useful to ensure that the health of the codebase is always maintained. It is the responsibility of the Agile team to make sure that if there is any failure in any of the above-mentioned automated processes, it is promptly addressed and no further commits are made to the codebase until the automated build is fixed. Continuous Delivery Continuous Delivery takes the concept of Continuous Integration a couple of steps further. In addition to making sure that different modules of a software system are always integrated, it also makes sure that the code is always deployable (to production). This means that in addition to having an automated build and a completely automated test suite, there should be an automated process of delivering the software to production. Using the automated process, it should be possible to deploy software on short notice, typically within minutes, with the click of a button. Continuous Delivery is one of the core principles of DevOps and offers many benefits including predictable deploys, reduced risk while introducing new features, shorter feedback cycles with the customer, and overall higher quality of software. Figure 2: A typical Continuous Delivery process The above diagram shows a typical Continuous Delivery process. Please note that the above-illustrated Continuous Delivery process assumes that a Continuous Integration process is already in place. The above diagram shows 2 environments: e.g., User Acceptance Test (UAT) and production. However, different organizations may have multiple staging environments (Quality Assurance or QA, load testing, pre-production, etc.) before the software makes it to the production environment. However, it is the same codebase, and more precisely, the same version of the codebase that gets deployed to different environments. Deployment to all staging environments and the production environment are performed through the same automated process. There are many tools available to manage configurations (as code) and make sure that deploys are automatic (usually self-service), controlled, repeatable, reliable, auditable, and reversible (can be rolled back). It is beyond the scope of this article to go over those DevOps tools, but the point here is to stress the fact that there must be an automated process to release software to production on demand. Database Change Management Is the Bottleneck Agile practices are pretty much mainstream nowadays when it comes to developing application code. However, we don’t see as much adoption of agile principles and continuous integration in the area of database development. Almost all enterprise applications have a database involved and thus project deliverables would involve some database-related work in addition to application code development. Therefore, slowness in the process of delivering database-related work - for example, a schema change - slows down the delivery of an entire release. In this article, we would assume the database to be a relational database management system. The processes would be very different if the database involved is a non-relational database like a columnar database, document database, or a database storing data in key-value pairs or graphs. Let me illustrate this scenario with a real example: here is this team that practices Agile software development methodologies. They follow a particular type of Agile called Scrum, and they have a 2-week Sprint. One of the stories in the current sprint is the inclusion of a new field in the document that they interchange with a downstream system. The development team estimated that the story is worth only 1 point when it comes to the development of the code. It only involves minor changes in the data access layer to save the additional field and retrieve it later when a business event occurs and causes the application to send out a document to a downstream system. However, it requires the addition of a new column to an existing table. Had there been no database changes involved, the story could have been easily completed in the current sprint, but since there is a database change involved, the development team doesn’t think it is doable in this sprint. Why? Because a schema change request needs to be sent to the Database Administrators (DBA). The DBAs will take some time to prioritize this change request and rank this against other change requests that they received from other application development teams. Once the DBAs make the changes in the development database, they will let the developers know and wait for their feedback before they promote the changes to the QA environment and other staging environments, if applicable. Developers will test changes in their code against the new schema. Finally, the development team will closely coordinate with the DBAs while scheduling delivery of application changes and database changes to production. Figure 3: Manual or semi-automated process in delivering database changes Please note in the diagram above that the process is not triggered by a developer checking in code and constitutes a handoff between two teams. Even if the deployment process on the database side is automated, it is not integrated with the delivery pipeline of application code. The changes in the application code are directly dependent on the database changes, and they together constitute a release that delivers a useful feature to the customer. Without one change, the other change is not only useless but could potentially cause regression. However, the lifecycle of both of these changes is completely independent of each other. The fact that the database and codebase changes follow independent life cycles and the fact that there are handoffs and manual checkpoints involved, the Continuous Delivery process, in this example, is broken. Recommendations To Fix CI/CD for Database Changes In the following sections, we will explain how this can be fixed and how database-related work including data modeling and schema changes, etc., can be brought under the ambit of the Continuous Delivery process. DBAs Should Be a Part of the Cross-Functional Agile Team Many organizations have their DBAs split into broadly two different types of roles based on whether they help to build a database for application development teams or maintain production databases. The primary responsibility of a production DBA is to ensure the availability of production databases. They monitor the database, take care of upgrades and patches, allocate storage, perform backup and recovery, etc. A development DBA, on the other hand, works closely with the application development team and helps them come up with data model design, converts a logical data model into a physical database schema, estimates storage requirements, etc. To bring database work and application development work into one single delivery pipeline, it is almost necessary that the development DBA be a part of the development team. Full-stack developers in the development team with good knowledge of the database may also wear the hat of a development DBA. Database as Code It is not feasible to have database changes and application code integrated into a single delivery pipeline unless database changes are treated the same way as application code. This necessitates scripting every change in the database and having them version-controlled. It should then be possible to stand up a new instance of the database automatically from the scripts on demand. If we had to capture database objects as code, we would first need to classify them and evaluate each one of those types to see if and how they need to be captured as script (code). Following is a broad classification of them: Database Structure This is basically the definition of how stored data will be structured in the database and is also known as a schema. These include table definitions, views, constraints, indexes, and types. The data dictionary may also be considered as a part of the database structure. Stored Code These are very similar to application code, except that they are stored in the database and are executed by the database engine. They include stored procedures, functions, packages, triggers, etc. Reference Data These are usually a set of permissible values that are referenced from other tables that store business data. Ideally, tables representing reference data have very few records and don’t change much over the life cycle of an application. They may change when some business process changes but don’t usually change during the normal course of business. Application Data or Business Data These are the data that the application records during the normal course of business. The main purpose of any database system is to store these data. The other three types of database objects exist only to support these data. Out of the above four types of database objects, the first three can be and should be captured as scripts and stored in a version control system. Type Example Scripted (Stored like code?) Database Structure Schema Objects like Tables, Views, Constraints, Indexes, etc. Yes Stored Code Triggers, Procedures, Functions, Packages, etc. Yes Reference Data Codes, Lookup Tables, Static data, etc. Yes Business/Application Data Data generated from day-to-day business operations No Table 1: Depicts what types of database objects can be scripted and what types can’t be scripted As shown in the table above, business data or application data are the only types that won’t be scripted or stored as code. All rollbacks, revisions, archival, etc., are handled by the database itself; however, there is one exception. When a schema change forces data migration - say, for example, populating a new column or moving data from a base table to a normalized table - that migration script should be treated as code and should follow the same life cycle as the schema change. Let's take an example of a very simple data model to illustrate how scripts may be stored as code. This model is so simple and so often used in examples, that it may be considered the “Hello, World!” of data modeling. Figure 4: Example model with tables containing business data and ones containing reference data In the model above, a customer may be associated with zero or more addresses, like billing address, shipping address, etc. The table AddressType stores the different types of addresses like billing, shipping, residential, work, etc. The data stored in AddressType can be considered reference data as they are not supposed to grow during day-to-day business operations. On the other hand, the other tables contain business data. As the business finds more and more customers, the other tables will continue to grow. Example Scripts: Tables: Constraints: Reference Data: We won’t get into any more details and cover each type of database object. The purpose of the examples is to illustrate that all database objects, except for business data, can be and should be captured in SQL scripts. Version Control Database Artifacts in the Same Repository as Application Code Keeping the database artifacts in the same repository of the version control system as the application code offers a lot of advantages. They can be tagged and released together since, in most cases, a change in database schema also involves a change in application code, and they together constitute a release. Having them together also reduces the possibility of application code and the database getting out of sync. Another advantage is just plain readability. It is easier for a new team member to come up to speed if everything related to a project is in a single place. Figure 5: Example structure of a Java Maven project containing database code The above screenshot shows how database scripts can be stored alongside application code. Our example is a Java application, structured as a Maven project. The concept is however agnostic of what technology is used to build the application. Even if it was a Ruby or a .NET application, we would store the database objects as scripts alongside application code to let CI/CD automation tools find them in one place and perform necessary operations on them like building the schema from scratch or generating migration scripts for a production deployment. Integrate Database Artifacts Into the Build Scripts It is important to include database scripts in the build process to ensure that database changes go hand in hand with application code in the same delivery pipeline. Database artifacts are usually SQL scripts of some form and all major build tools support executing SQL scripts either natively or via plugins. We won’t get into any specific build technology but will list down the tasks that the build would need to perform. Here we are talking about builds in local environments or CI servers. We will talk about builds in staging environments and production at a later stage. The typical tasks involved are: Drop Schema Create Schema Create Database Structure (or schema objects): They include tables, constraints, indexes, sequences, and synonyms. Deploy stored code, like procedures, functions, packages, etc. Load reference data Load TEST data If the build tool in question supports build phases, this will typically be in the phase before integration tests. This ensures that the database will be in a stable state with a known set of data loaded. There should be sufficient integration tests that will cause the build to fail if the application code goes out of sync with the data model. This ensures that the database is always integrated with the application code: the first step in achieving a Continuous Delivery model involving database change management. Figure 6: Screenshot of code snippet showing a Maven build for running database scripts The above screenshot illustrates the usage of a Maven plugin to run SQL scripts. It drops the schema, recreates it, and runs all the DDL scripts to create tables, constraints, indexes, sequences, and synonyms. Then it deploys all the stored code into the database and finally loads all reference data and test data. Refactor Data Model as Needed Agile methodology encourages evolutionary design over upfront design; however, many organizations that claim to be Agile shops, actually perform an upfront design when it comes to data modeling. There is a perception that schema changes are difficult to implement later in the game, and thus it is important to get it right the first time. If the recommendations made in the previous sections are made, like having an integrated team with developers and DBAs, scripting database changes, and version controlling them alongside application code, it won’t be difficult to automate all schema changes. Once the deployment and rollback of database changes are fully automated and there is a suite of automated tests in place, it should be easy to mitigate risks in refactoring schema. Avoid Shared Database Having a database schema shared by more than one application is a bad idea, but they still exist. There is even a mention of a “Shared Database” as an integration pattern in a famous book on enterprise integration patterns, Enterprise Integration Patterns by Gregor Holpe and Bobby Woolf. Any effort to bring application code and database changes under the same delivery pipeline won’t work unless the database truly belongs to the application and is not shared by other applications. However, this is not the only reason why a shared database should be avoided. "Shared Database" also causes tight coupling between applications and a multitude of other problems. Dedicated Schema for Every Committer and CI Server Developers should be able to work on their own sandboxes without the fear of breaking anything in a common environment like the development database instance; similarly, there should be a dedicated sandbox for the CI server as well. This follows the pattern of how application code is developed. A developer makes changes and runs the build locally, and if the build succeeds and all the tests pass, (s)he commits the changes. The sandbox could be either an independent database instance, typically installed locally on the developer’s machine, or it could be a different schema in a shared database instance. Figure 7: Developers make changes in their local environment and commit frequently As shown in the above diagram, each developer has their own copy of the schema. When a full build is performed, in addition to building the application, it also builds the database schema from scratch. It drops the schema, recreates it, and executes DDL scripts to load all schema objects like tables, views, sequences, constraints, and indexes. It creates objects representing stored code, like functions, procedures, packages, and triggers. Finally, it loads all the reference data and test data. Automated tests ensure that the application code and database object are always in sync. It must be noted that data model changes are less frequent than application code, so the build script should have the option to skip the database build for the sake of build performance. The CI build job should also be set up to have its own sandbox of the database. The build script performs a full build that includes building the application as well as building the database schema from scratch. It runs a suite of automated tests to ensure that the application itself and the database that it interacts with, are in sync. Figure 8: Revised CI process with integration of database build with build of application code Please note that the similarity of the process described in the above diagram with the one described in Figure 1. The build machine or the CI server contains a build job that is triggered by any commit to the repository. The build that it performs includes both the application build and the database build. The database scripts are now always integrated, just like application code. Dealing With Migrations The process described above would build the database schema objects, stored code, reference data, and test data from scratch. This is all good for continuous integration and local environments. This process won’t work for the production database and even QA or UAT environments. The real purpose of any database is storing business data, and every other database object exists only to support business data. Dropping schema and recreating it from scripts is not an option for a database currently running business transactions. In this case, there is a need for scripting deltas, i.e., the changes that will transition the database structure from a known state (a particular release of software) to a desired state. The transition will also include any data migration. Schema changes may lead to a requirement to migrate data as well. For example, as a result of normalization, data from one table may need to be migrated to one or more child tables. In such cases, a script that transforms data from the parent table to the children should also be a part of the migration scripts. Schema changes may be scripted and maintained in the source code repository so that they are part of the build. These scripts may be hand-coded during active development, but there are tools available to automate that process as well. One such tool is Flyway, which can generate migration scripts for the transition of one state of schema to another state. Figure 9: Automation of schema migrations and rollback In the above picture, the left-hand side shows the current state of the database which is in sync with the application release 1.0.1 (the previous release). The right-hand side shows the desired state of the database in the next release. We have the state on the left-hand side captured and tagged in the version control system. The right-hand side is also captured in the version control system as the baseline, master branch, or trunk. The difference between the right-hand side and the left-hand side is what needs to be applied to the database in the staging environments and the production environment. The differences may be manually tracked and scripted, which is laborious and error-prone. The above diagram illustrates that tools like Flyway can automate the creation of such differences in the form of migration scripts. The automated process will create the following: Migration script (to transition the database from the prior release to the new release) Rollback script (to transition the database back to the previous release). The generated scripts will be tagged and stored with other deploy artifacts. This automation process may be integrated with the Continuous Delivery process to ensure repeatable, reliable, and reversible (ability to rollback) database changes. Continuous Delivery With Database Changes Incorporated Into It Let us now put the pieces together. There is a Continuous Integration process already in place that rebuilds the database along with the application code. We have a process in place that generates migration scripts for the database. These generated migration scripts are a part of the deployment artifacts. The DevOps tools will use these released artifacts to build any of the staging environments or the production environment. The deployment artifacts will also contain rollback scripts to support self-service rollback. If anything goes wrong, the previous version of the application may then be redeployed and the database rollback script shall be run to transition the database schema to the previous state that is in sync with the previous release of the application code. Figure 10: Continuous Delivery incorporating database changes The above diagram depicts a Continuous Delivery process that has database change management incorporated into it. This assumes that a Continuous Integration process is already there in place. When a UAT (or any other staging environment like TEST, QA, etc.) deployment is initiated, the automated processes take care of creating a tag in the source control repository, building application deployable artifacts from the tagged codebase, generating database migration scripts, assembling the artifacts and deploying. The deployment process includes the deployment of the application as well as applying migration scripts to the database. The same artifacts will be used to deploy the application to the production environment, following the approval process. A rollback would involve redeploying the previous release of the application and running the database rollback script. Tools Available in the Market The previous sections primarily describe how to achieve CI/CD in a project that involves database changes by following some processes but don’t particularly take into consideration any tools that help in achieving them. The above recommendations are independent of any particular tool. A homegrown solution can be developed using common automation tools like Maven or Gradle for build automation, Jenkins or TravisCI for Continuous Integration, and Chef or Puppet for configuration management; however, there are many tools available in the marketplace, that specifically deal with Database DevOps. Those tools may also be taken advantage of. Some examples are: Datical Redgate Liquibase Flyway Conclusion Continuous Integration and Continuous Delivery processes offer tremendous benefits to organizations, like accelerated time to market, reliable releases, and overall higher-quality software. Database change management is traditionally cautious and slow. In many cases, database changes involve manual processes and often cause a bottleneck to the Continuous Delivery process. The processes and best practices mentioned in this article, along with available tools in the market, should hopefully eliminate this bottleneck and help to bring database changes into the same delivery pipeline as application code.
In today's rapidly evolving digital landscape, marked by the ascendancy of Artificial Intelligence (AI) and the ubiquity of cloud computing, the importance of database security has never been more pronounced. As databases increasingly become the backbone of AI algorithms and cloud-based services, they amass vast amounts of sensitive information, making them prime targets for cyberattacks. The convergence of these technologies not only amplifies the potential risks but also complicates the security dynamics, necessitating a more vigilant and sophisticated approach to safeguarding data. What Is Database Security? Database security is the practice of protecting and securing data from unauthorized access, corruption, or theft throughout its lifecycle. It encompasses a range of measures designed to safeguard databases, which are critical repositories of sensitive information in an organization. Effective database security strategies not only protect data integrity and privacy but also ensure regulatory compliance and maintain organizational reputation. As databases are the center of many data ecosystems, database security can encompass everything from network protocols, application access control lists, to firewalls. Security shouldn’t just stop or isolate to the database tier when developing a database security plan. Database Security Best Practices Implementing best practices is fundamental in safeguarding databases and ensures the protection of critical data systems for any organization. It is essential to have not only robust tools to automate monitoring and management, but to have regular reviews of database systems. Best Practices Best practices should include the following: Regular audits and monitoring, along with routine audits to track database activities and identify anomalies Encrypting data at rest and in transit to prevent unauthorized access Implementing strict access control policies, including role-based access and least required privileges Although many backup and recovery practices have been automated in an organization's cloud journey, ensuring there is an enterprise-level backup of all databases and having a robust recovery plan is essential. Another area that is often overlooked in the day of automation with cloud: ensuring that all systems are regularly updating and patching database software whenever required to protect from vulnerabilities Performing physical server security for all hosted databases and ensuring the hosts, including cloud services and credentials are securely protected Learn more about attribute-based access control with Spring Security. One of the biggest challenges in collecting the list of tools below and categorizing them is due to the feature-rich capabilities of a tool — there may be considerable overlap. Although a tool may be in one category, it could belong to more than one. Always do deeper research to find out the full coverage of any tool listed below. External Threats Protecting a database from external threats is a multifaceted challenge that requires a combination of strong network security, system hardening, and vigilant monitoring. Several open-source tools can significantly bolster your defenses against such threats. Here are some key types of tools and examples in each category. Firewalls and Network Security Tools pfSense: A powerful firewall and router software package that is highly configurable and includes features like VPN, intrusion prevention, and more UFW (Uncomplicated Firewall): An easy-to-use interface for managing iptables, the default firewall tool on Linux, providing a simplified approach to configuring your firewall. Review more about firewall bypassing techniques. Intrusion Detection and Prevention Systems (IDS/IPS) Snort: An open-source network intrusion detection and prevention system that can identify a wide range of attacks, including attempts to breach database security Suricata: Another powerful open-source IDS/IPS capable of real-time intrusion detection, inline intrusion prevention, network security monitoring, and offline PCAP processing Vulnerability Scanners OpenVAS (Greenbone Vulnerability Management): A comprehensive vulnerability scanning and vulnerability management solution Nmap: A network scanning tool that is one of the oldest and most trusted, which can be used to discover hosts and services on a network, thus providing insights into potential vulnerabilities Encryption Tools Let's Encrypt: Provides free SSL/TLS certificates, ensuring that data transmitted to and from your database is encrypted GnuPG: Can be used to encrypt data before it is stored in the database OpenLDAP: An open-source implementation of the Lightweight Directory Access Protocol, used for implementing robust authentication mechanisms It's important to note that no single tool provides complete protection. A layered approach combining several of these tools, along with best practices in configuration, patch management, and access control, is necessary for robust database security. Regular updates, patches, and security audits are also crucial components of a comprehensive security strategy. Insider Threats Tracking risky privileges and detecting abnormal activity within a database are essential components of database security, particularly for mitigating the risk of insider threats and ensuring that only authorized users have access to sensitive data. Several open-source tools can assist in monitoring and managing database privileges and activities. Some examples of open-source and enterprise-ready insider threat tools are: Apache Metron: Metron integrates a variety of open-source big data technologies to offer a centralized tool for security monitoring and analysis. It can be used to monitor database activity and detect anomalies in real time. Osquery: Developed by Facebook, Osquery is an operating system instrumentation framework that exposes the OS as a high-performance relational database. It allows you to query your system as if it were a relational database, which can be used to monitor database processes and unusual activities. auditd: Part of the Linux Auditing System, auditd is a component that can be used to track system modifications, potentially catching unauthorized changes to database configurations or unauthorized access attempts. Prometheus and Grafana: Prometheus can be used for event monitoring and alerting. Combined with Grafana for analytics and monitoring, this toolset can track database performance metrics and alert you to anomalies. Fail2Ban: Although generally used for protecting servers from brute force attacks, Fail2Ban can also be configured to monitor logs for certain database systems and ban IPs that show malicious activity patterns. Lynis: This is a security auditing tool for Unix-based systems. Lynis performs extensive health scans and security audits to assess and improve security defense, including database configurations. Mongo DB Atlas: For those using MongoDB, Mongo DB Atlas provides built-in monitoring and alerting services that can help track access and activities, although it's more of a service than a traditional open-source tool. These tools can provide valuable insights into who is accessing your databases, what they are doing, and whether their behavior aligns with established patterns of normal activity. However, the effectiveness of these tools largely depends on proper configuration, regular updates, and integration into your broader database security strategy. Remember, tools are only as effective as the policies and practices that guide their use. Regular audits, user training, and a culture of security awareness are also key in mitigating risks associated with database privileges and activities. SQL Injection and Database Scanners Protecting databases from SQL Injection attacks is a critical aspect of database security. While there are numerous strategies to prevent these attacks, including proper coding practices and validation techniques, certain open-source tools can also significantly enhance your defenses. Here are some notable ones: OWASP ModSecurity: This is an open-source, cross-platform web application firewall (WAF) developed by the Open Web Application Security Project (OWASP). ModSecurity provides protection from a variety of attacks against web applications, including SQL Injection. It can be used with web servers like Apache, Nginx, and IIS. Read DZone’s coverage of how to run OWASP ZAP security tests in Azure DevOps. sqlmap: sqlmap is an open-source penetration testing tool that automates the process of detecting and exploiting SQL Injection flaws and taking over database servers. It comes with a powerful detection engine and many features for the ultimate penetration tester. SQLChop: An open-source SQL injection analysis tool, which can perform comprehensive inspections on database operations NoSQLMap: Designed for auditing and automating the detection of vulnerabilities in NoSQL databases and web applications libInjection: libInjection is a library that specializes in detecting SQL Injection vulnerabilities. Developers can use it to scan inputs and identify if they contain SQL Injection attacks. In recent years, Libinjection has been bypassed by a number of hackers and although still valuable as part of a security suite of tools, I debated with myself if it should be kept on the list. Remember, while these tools are helpful, they should be part of a broader security strategy that includes secure coding practices, regular updates and patches, and thorough testing. SQL Injection is often a result of flaws in application code, so developer awareness and secure coding practices are just as important as deploying the right tools Data Breaches Tracking and preventing data breaches is a critical task in cybersecurity. Open-source tools can be particularly valuable in this regard due to their community-driven nature, which often leads to rapid updates and a wide range of features. Here are some of the best open-source tools for tracking data breaches and preventing them: OSSEC (Open Source Security): OSSEC is a scalable, multi-platform, open-source Host-Based Intrusion Detection System (HIDS). It performs log analysis, file integrity checking, policy monitoring, rootkit detection, real-time alerting, and active response, making it a comprehensive tool for breach detection. Wazuh: Wazuh is a fork of OSSEC and extends its capabilities with more advanced features like compliance monitoring, richer visualization, and integration with Elastic Stack. It's excellent for intrusion detection, vulnerability detection, and incident response and is cloud-ready. Elasticsearch, Logstash, and Kibana (ELK Stack): This stack is powerful for log analysis and monitoring. By collecting, analyzing, and visualizing data from various sources, including network traffic, server logs, and application logs, ELK Stack helps in detecting and analyzing data breaches. GRR Rapid Response: GRR is an incident response framework focused on remote live forensics. It provides rapid analysis and insights into potential breaches and is particularly useful in large networks with Google as one of the main contributors. Security Onion: This is a Linux distribution for intrusion detection, enterprise security monitoring, and log management. It includes a suite of tools like Snort, Suricata, Zeek, Wazuh, and many others, which are essential for detecting and analyzing data breaches and supports cloud. It's important to note that while these tools are powerful for detecting potential breaches and intrusions, preventing breaches also relies heavily on proactive measures like regular system updates, robust access controls, employee training, and adherence to security best practices. These tools should be integrated into a broader security strategy for maximum effectiveness. Additional Tips Securing a database involves a multi-layered approach: Strong authentication protocols: Using multifactor authentication to enhance access security Implementing firewalls: Deploying database firewalls to monitor and regulate incoming and outgoing traffic Regular vulnerability assessments: Periodically assessing the database for vulnerabilities and addressing them promptly Educating users: Training employees and users on security best practices and potential threats Conclusion In an era where data breaches are increasingly common, understanding and implementing database security best practices is not just a technical necessity but a business imperative. Leveraging the right tools and strategies can ensure the integrity, confidentiality, and availability of data, thereby safeguarding an organization's most valuable digital assets. As threats evolve, so should the approaches to database security, requiring ongoing vigilance, adaptation, and education.
Slow query times in large datasets are a common headache in database management. MariaDB ColumnStore offers a neat way out of this. It's a columnar storage engine that significantly speeds up data analytics. Typically, you can improve query performance in relational databases by adding appropriate indexes. However, maintaining indexes is hard, especially with ad-hoc queries where you don't know where indexes are going to be needed. ColumnStore eases this pain. It's as if you had an index on each column but without the hassle of creating and updating them. The price to pay? Well, inserts are not as fast as with InnoDB, so this is not the best option for operational/transactional databases but rather for analytical ones. Bulk inserts are very fast though. There's plenty of online documentation about ColumnStore, so I won't go through all the details on how it works or how to deploy it on production. Instead, in this article, I'll show you how to try MariaDB ColumnStore on your computer using Docker. Pre-Requisites You'll need: The mariadb command line tool Docker Setting up MariaDB ColumnStore Run a container with MariaDB + ColumnStore: Shell docker run -d -p 3307:3306 -e PM1=mcs1 --hostname=mcs1 --name mcs1 mariadb/columnstore This command runs a new Docker container using the official ColumnStore image, with several specified options: docker run: Starts a new Docker container. -d: Runs the container in detached mode (in the background). -p 3307:3306: Maps port 3307 on the host (your computer) to port 3306 inside the container. This makes the database accessible on the port 3307 on the host machine. -e PM1=mcs1: The PM1 environment variable PM1 specifies the primary database node (mcs1). --hostname=mcs1: Sets the hostname of the container to mcs1. --name mcs1: Names the container mcs1. mariadb/columnstore: Specifies the Docker image to use, in this case, an image for MariaDB with the ColumnStore storage engine. Provision ColumnStore: Shell docker exec -it mcs1 provision mcs1 The command docker exec is used to interact with a running Docker container. This is what each option does: docker exec: Executes a command in a running container. -it: This option ensures the command is run in interactive mode with a terminal. mcs1 (first occurrence): This is the name of the Docker container in which the command is to be executed. provision mcs1 This is the specific command being executed inside the container. provision is a script included in the Docker image that initializes and configures the MariaDB ColumnStore environment within the container? The argument mcs1 is passed to the provision command to specify the host for the MariaDB server within the Docker container. Connect to the MariaDB server using the default credentials defined in the MariaDB ColumnStore Docker image: Shell mariadb -h 127.0.0.1 -P 3307 -u admin -p'C0lumnStore!' Check that ColumnStore is available as a storage engine by running the following SQL sentence: Shell SHOW ENGINES; Setting up a Demo Database Create the operations database and its InnoDB tables: SQL CREATE DATABASE operations; CREATE TABLE operations.doctors( id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL CHECK(TRIM(name) != '') ) ENGINE=InnoDB; CREATE TABLE operations.appointments( id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL CHECK(TRIM(name) != ''), phone_number VARCHAR(15) NOT NULL CHECK(phone_number RLIKE '[0-9]+'), email VARCHAR(254) NOT NULL CHECK(TRIM(email) != ''), time DATETIME NOT NULL, reason ENUM('Consultation', 'Follow-up', 'Preventive', 'Chronic') NOT NULL, status ENUM ('Scheduled', 'Canceled', 'Completed', 'No Show'), doctor_id BIGINT UNSIGNED NOT NULL, CONSTRAINT fk_appointments_doctors FOREIGN KEY (doctor_id) REFERENCES doctors(id) ) ENGINE=InnoDB; Create the analytics database and its ColumnStore table: Shell CREATE DATABASE analytics; CREATE TABLE analytics.appointments( id BIGINT UNSIGNED NOT NULL, name VARCHAR(200) NOT NULL, phone_number VARCHAR(15) NOT NULL, email VARCHAR(254) NOT NULL, time DATETIME NOT NULL, reason VARCHAR(15) NOT NULL, status VARCHAR(10) NOT NULL, doctor_id BIGINT UNSIGNED NOT NULL ) ENGINE=ColumnStore; You can use the same database (or schema, they are synonyms in MariaDB) for both the InnoDB and ColumnStore tables if you prefer. Use a different name for the ColumnStore table if you opt for this alternative. Inserting Demo Data Insert a few doctors: SQL INSERT INTO operations.doctors(name) VALUES ("Maria"), ("John"), ("Jane"); Create a new file with the name test_data_insert.py with the following content: SQL import random import os import subprocess from datetime import datetime, timedelta # Function to generate a random date within a given range def random_date(start, end): return start + timedelta(days=random.randint(0, int((end - start).days))) # Function to execute a given SQL command using MariaDB def execute_sql(sql): # Write the SQL command to a temporary file with open("temp.sql", "w") as file: file.write(sql) # Execute the SQL command using the MariaDB client subprocess.run(["mariadb", "-h", "127.0.0.1", "-P", "3307", "-u", "admin", "-pC0lumnStore!", "-e", "source temp.sql"]) # Remove the temporary file os.remove("temp.sql") print("Generating and inserting data...") # Total number of rows to be inserted total_rows = 4000000 # Number of rows to insert in each batch batch_size = 10000 # Possible values for the 'reason' column and their associated weights for random selection reasons = ["Consultation", "Follow-up", "Preventive", "Chronic"] reason_weights = [0.5, 0.15, 0.25, 0.1] # Possible values for the 'status' column and their associated weights for random selection statuses = ["Scheduled", "Canceled", "Completed", "No Show"] status_weights = [0.1, 0.15, 0.7, 0.05] # Possible values for the 'doctor_id' column and their associated weights for random selection doctors = [1, 2, 3] doctors_weights = [0.4, 0.35, 0.25] # List of patient names names = [f"Patient_{i}" for i in range(total_rows)] # Insert data in batches for batch_start in range(0, total_rows, batch_size): batch_values = [] # Generate data for each row in the batch for i in range(batch_start, min(batch_start + batch_size, total_rows)): name = names[i] phone_number = f"{random.randint(100, 999)}-{random.randint(100, 999)}-{random.randint(1000, 9999)}" email = f"patient_{i}@example.com" time = random_date(datetime(2023, 1, 1), datetime(2024, 1, 1)).strftime("%Y-%m-%d %H:%M:%S") reason = random.choices(reasons, reason_weights)[0] status = random.choices(statuses, status_weights)[0] doctor_id = random.choices(doctors, doctors_weights)[0] # Append the generated row to the batch batch_values.append(f"('{name}', '{phone_number}', '{email}', '{time}', '{reason}', '{status}', {doctor_id})") # SQL command to insert the batch of data into the 'appointments' table sql = "USE operations;\nINSERT INTO appointments (name, phone_number, email, time, reason, status, doctor_id) VALUES " + ", ".join(batch_values) + ";" # Execute the SQL command execute_sql(sql) # Print progress print(f"Inserted up to row {min(batch_start + batch_size, total_rows)}") print("Data insertion complete.") Insert 4 million appointments by running the Python script: python3 test_data_insert.py Populate the ColumnStore table by connecting to the database and running: SQL INSERT INTO analytics.appointments ( id, name, phone_number, email, time, reason, status, doctor_id ) SELECT appointments.id, appointments.name, appointments.phone_number, appointments.email, appointments.time, appointments.reason, appointments.status, appointments.doctor_id FROM operations.appointments; Run Cross-Engine SQL Queries MariaDB ColumnStore is designed to run in a cluster of multiple servers. It is there where you see massive performance gains in analytical queries. However, we can also see this in action with the single-node setup of this article. Run the following query and pay attention to the time it needs to complete (make sure it queries the operations database): SQL SELECT doctors.name, status, COUNT(*) AS count FROM operations.appointments -- use the InnoDB table JOIN doctors ON doctor_id = doctors.id WHERE status IN ( 'Scheduled', 'Canceled', 'Completed', 'No Show' ) GROUP BY doctors.name, status ORDER BY doctors.name, status; On my machine, it took around 3 seconds. Now modify the query to use the ColumnStore table instead (in the analytics database): SQL SELECT doctors.name, status, COUNT(*) AS count FROM analytics.appointments -- use the ColumnStore table JOIN doctors ON doctor_id = doctors.id WHERE status IN ( 'Scheduled', 'Canceled', 'Completed', 'No Show' ) GROUP BY doctors.name, status ORDER BY doctors.name, status; It takes less than a second. Of course, you can speed up the first query by adding an index in this simplistic example, but imagine the situation in which you have hundreds of tables—it will become harder and harder to manage indexes. ColumnStore removes this complexity.
The Spring AI is a new project of the Spring ecosystem that streamlines the creation of AI applications in Java. By using Spring AI together with PostgreSQL pgvector, you can build generative AI applications that draw insights from your data. First, this article introduces you to the Spring AI ChatClient that uses the OpenAI GPT-4 model to generate recommendations based on user prompts. Next, the article shows how to deploy PostgreSQL with the PGVector extension and perform vector similarity searches using the Spring AI EmbeddingClient and Spring JdbcClient. Adding Spring AI Dependency Spring AI supports many large language model (LLM) providers, with each LLM having its own Spring AI dependency. Let's assume that you prefer working with OpenAI models and APIs. Then, you need to add the following dependency to a project: XML <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-openai-spring-boot-starter</artifactId> <version>{latest.version}</version> </dependency> Also, at the time of writing, Spring AI was in active development, with the framework artifacts being released in the Spring Milestone and/or Snapshot repositories. Thus, if you still can't find Spring AI on https://start.spring.io/, then add the repositories to the pom.xml file: XML <repositories> <repository> <id>spring-milestones</id> <name>Spring Milestones</name> <url>https://repo.spring.io/milestone</url> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>spring-snapshots</id> <name>Spring Snapshots</name> <url>https://repo.spring.io/snapshot</url> <releases> <enabled>false</enabled> </releases> </repository> </repositories> Setting Up OpenAI Module The OpenAI module comes with several configuration properties, allowing the management of connectivity-related settings and fine-tuning the behavior of OpenAI models. At a minimum, you need to provide your OpenAI API key, which will be used by Spring AI to access GPT and embedding models. Once the key is created, add it to the application.properties file: Properties files spring.ai.openai.api-key=sk-... Then, if necessary, you can select particular GPT and embedding models: Properties files spring.ai.openai.chat.model=gpt-4 spring.ai.openai.embedding.model=text-embedding-ada-002 In the end, you can test that the OpenAI module is configured properly by implementing a simple assistant with Spring AI's ChatClient: Java // Inject the ChatClient bean @Autowired private ChatClient aiClient; // Create a system message for ChatGPT explaining the task private static final SystemMessage SYSTEM_MESSAGE = new SystemMessage( """ You're an assistant who helps to find lodging in San Francisco. Suggest three options. Send back a JSON object in the format below. [{\"name\": \"<hotel name>\", \"description\": \"<hotel description>\", \"price\": <hotel price>}] Don't add any other text to the response. Don't add the new line or any other symbols to the response. Send back the raw JSON. """); public void searchPlaces(String prompt) { // Create a Spring AI prompt with the system message and the user message Prompt chatPrompt = new Prompt(List.of(SYSTEM_MESSAGE, new UserMessage(prompt))); // Send the prompt to ChatGPT and get the response ChatResponse response = aiClient.generate(chatPrompt); // Get the raw JSON from the response and print it String rawJson = response.getGenerations().get(0).getContent(); System.out.println(rawJson); } For the sake of the experiment, if you pass the "I'd like to stay near the Golden Gate Bridge" prompt, then the searchPlaces the method might provide lodging recommendations as follows: JSON [ {"name": "Cavallo Point", "description": "Historic hotel offering refined rooms, some with views of the Golden Gate Bridge, plus a spa & dining.", "price": 450}, {"name": "Argonaut Hotel", "description": "Upscale, nautical-themed hotel offering Golden Gate Bridge views, plus a seafood restaurant.", "price": 300}, {"name": "Hotel Del Sol", "description": "Colorful, retro hotel with a pool, offering complimentary breakfast & an afternoon cookies reception.", "price": 200} ] Starting Postgres With PGVector If you run the previous code snippet with the ChatClient, you'll notice that it usually takes over 10 seconds for the OpenAI GPT model to generate a response. The model has a broad and deep knowledge base, and it takes time to produce a relevant response. Apart from the high latency, the GPT model might not have been trained on data that is relevant to your application workload. Thus, it might generate responses that are far from being satisfactory for the user. However, you can always expedite the search and provide users with accurate responses if you generate embeddings on a subset of your data and then let Postgres work with those embeddings. The pgvector extension allows storing and querying vector embeddings in Postgres. The easiest way to start with PGVector is by starting a Postgres instance with the extension in Docker: Shell mkdir ~/postgres-volume/ docker run --name postgres \ -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password \ -p 5432:5432 \ -v ~/postgres-volume/:/var/lib/postgresql/data -d ankane/pgvector:latest Once started, you can connect to the container and enable the extension by executing the CREATE EXTENSION vector statement: Shell docker exec -it postgres psql -U postgres -c 'CREATE EXTENSION vector' Lastly, add the Postgres JDBC driver dependency to the pom.xml file: XML <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>{latest.version}</version> </dependency> Configure the Spring DataSource by adding the following settings to the application.properties file: Properties files spring.datasource.url = jdbc:postgresql://127.0.0.1:5432/postgres spring.datasource.username = postgres spring.datasource.password = password Performing Vector Similarity Search With Spring AI At a minimum, the vector similarity search is a two-step process. First, you need to use an embedding model to generate a vector/embedding for a provided user prompt or other text. Spring AI supports the EmbeddingClient that connects to OpenAI's or other providers' embedding models and generates a vectorized representation for the text input: Java // Inject the Spring AI Embedding client @Autowired private EmbeddingClient aiClient; public List<Place> searchPlaces(String prompt) { // Use the Embedding client to generate a vector for the user prompt List<Double> promptEmbedding = aiClient.embed(prompt); ... } Second, you use the generated embedding to perform a similarity search across vectors stored in the Postgres database. For instance, you can use the Spring JdbcClient for this task: Java @Autowired private JdbcClient jdbcClient; // Inject the Spring AI Embedding client @Autowired private EmbeddingClient aiClient; public List<Place> searchPlaces(String prompt) { // Use the Embedding client to generate a vector for the user prompt List<Double> promptEmbedding = aiClient.embed(prompt); // Perform the vector similarity search StatementSpec query = jdbcClient.sql( "SELECT name, description, price " + "FROM airbnb_listing WHERE 1 - (description_embedding <=> :user_promt::vector) > 0.7 " + "ORDER BY description_embedding <=> :user_promt::vector LIMIT 3") .param("user_promt", promptEmbedding.toString()); // Return the recommended places return query.query(Place.class).list(); } The description_embedding column stores embeddings that were pre-generated for Airbnb listing overviews from the description column. The Airbnb embeddings were produced by the same model that is used by Spring AI's EmbeddingClient for the user prompts. Postgres uses PGVector to calculate the cosine distance (<=>) between the Airbnb and user prompt embeddings (description_embedding <=> :user_prompt::vector) and then returns only those Airbnb listings whose description is > 0.7 similar to the provided user prompt. The similarity is measured as a value in the range from 0 to 1. The closer the similarity to 1, the more related the vectors are. What's Next Spring AI and PostgreSQL PGVector provide all the essential capabilities needed for building generative AI applications in Java. If you're curious to learn more, watch this hands-on tutorial. It guides you through the process of creating a lodging recommendation service in Java from scratch, optimizing similarity searches with specialized indexes, and scaling with distributed Postgres (YugabyteDB):
SQL Studio is one of the core components of Magic. One of its features is that it allows you to visually design your database. This significantly simplifies software development since it allows you to focus on "the what" instead of "the how." To illustrate the importance of the above point, try to write down the SQL statement required to create a database table in SQLite with an auto-increment primary key and two text columns. Chances are you cannot do it without searching. Software development as an art form contains too many constructs to be able to fit all of them into your head, which is a large part of the reason why low-code, no-code, and AI so drastically improve your productivity. Below is how you create a new column in SQL Studio. In the following video, I walk you through SQL Studio's most important features, covering, among other things: Creating tables Creating columns Creating indexed columns Creating foreign keys Creating primary keys Automatically create migration scripts Using AI to create SQL DDL scripts Saving SQL snippets for later Exporting data from your database to CSV files Etc... If you're to teach yourself the stuff I go through using a "drag'n'drop" type of user interface in the above video, you could easily spend years reading books and studying at the university level. I run through everything in 20 minutes, at which point you'll have a basic understanding of most concepts required to create and manage your own database afterward. It doesn't take a rocket scientist to understand the advantages here I presume... SQL Editor In addition to the visual designer, SQL Studio also allows you to execute any SQL you wish. Below is a screenshot of the autocomplete features, giving you suggestions for tables and columns as you write your SQL. You can connect it to any of the following RDBMS systems. MySQL PostgreSQL Microsoft SQL Server MariaDB SQLite The designer will automatically adjust its SQL dialect, while the SQL editor transparently gives you autocomplete and syntax highlighting. Combined with the ability to connect to a cloud database, this gives you an intuitive and pleasent environment to manage your database, such as those hosted with Amazon or Azure. Below is a screenshot of how you connect to an existing database. Storing SQL Snippets Magic also allows you to save frequently used SQL snippets. This can be a handy feature for frequently used SQL statements you need to execute periodically to create reports, for instance. Combined with the ability to export your result sets as CSV files, this allows you to rapidly create, for instance, reports that you export to Excel once every month, etc. DRY DRY is one of those acronyms we software developers seem to be obsessed with. It means "Don't Repeat Yourself." Some would argue that every time you repeat yourself, you're losing money because you're doing something that could be automated. Magic Cloud is all about becoming DRY — SQL Studio is one manifestation of that by eliminating repetitive tasks while reducing the cognitive complexity required to perform said tasks. I once heard somebody explain automation as follows. When you work with automation, nothing happens for a very long time. Then everything happens almost instantly Automation Yesterday, I was in a meeting with an automation project manager who was responsible for 3,600 retail stores. He told me they had a project where they were able to automate 10 minutes of manual work for each store manager per day. I wasn't particularly impressed initially before I realized that 10 minutes multiplied by 3,600 stores becomes 36,000 minutes every single day. 36,000 minutes becomes 600 hours of saved time every single day. Multiplied by one year, we're looking at 219,000 hours. The average employee works 1,760 hours per year. This implies the automation expert was saving 124 full time positions every single year. Multiplying that by an average salary cost of $30,000 annually becomes 3,7 million dollars annually. 10 minutes per day became 3.7 million dollars in savings Doing a basic ROI cost/effect analysis on the above implies the automation expert could, in theory, have hired 124 people working full-time on the problem for 12 months, and the ROI of the finished project would be worth it. Of course, he didn't have 124 people; he probably only had a handful of people working on it for a month or something. However, this equation illustrates the importance of automation. The key to automation is to look for those places where you're doing repetitive work, as in "wet work," implying places where you are not DRY... Saving yourself 10 minutes of work per day, might save your employer 3.7 million dollars per year! And I'm willing to be a kidney on that if you're to write down the SQL required to create a table with an auto increment primary key and two text fields for SQLite, you will easily spend 10 minutes Googling for it before you've found the answer...
In a previous blog, I demonstrated how to use Redis (Elasticache Serverless as an example) as a chat history backend for a Streamlit app using LangChain. It was deployed to EKS and also made use of EKS Pod Identity to manage the application Pod permissions for invoking Amazon Bedrock. This use-case here is a similar one: a chat application. I will switch back to implementing things in Go using langchaingo (I used Python for the previous one) and continue to use Amazon Bedrock. But there are a few unique things you can explore in this blog post: The chat application is deployed as an AWS Lambda function along with a Function URL. It uses DynamoDB as the chat history store (aka Memory) for each conversation - I extended langchaingo to include this feature. Thanks to the AWS Lambda Web Adapter, the application was built as a (good old) REST/HTTP API using a familiar library (in this case, Gin). And the other nice add-on was to be able to combine Lambda Web Adapter streaming response feature with Amazon Bedrock streaming inference API. Deploy Using SAM CLI (Serverless Application Model) Make sure you have Amazon Bedrock prerequisites taken care of and the SAM CLI installed. git clone https://github.com/abhirockzz/chatbot-bedrock-dynamodb-lambda-langchain cd chatbot-bedrock-dynamodb-lambda-langchain Run the following commands to build the function and deploy the entire app infrastructure (including the Lambda Function, DynamoDB, etc.) sam build sam deploy -g Once deployed, you should see the Lambda Function URL in your terminal. Open it in a web browser and start conversing with the chatbot! Inspect the DynamoDB table to verify that the conversations are being stored (each conversation will end up being a new item in the table with a unique chat_id): aws dynamodb scan --table-name langchain_chat_history Scan operation is used for demonstration purposes. Using Scan in production is not recommended. Quick Peek at the Good Stuff Using DynamoDB as the backend store history: Refer to the GitHub repository if you are interested in the implementation. To summarize, I implemented the required functions of the schema.ChatMessageHistory. Lambda Web Adapter Streaming response + LangChain Streaming: I used the chains.WithStreamingFunc option with the chains.Call call and then let Gin Stream do the heavy lifting of handling the streaming response. Here is a sneak peek of the implementation (refer to the complete code here): _, err = chains.Call(c.Request.Context(), chain, map[string]any{"human_input": message}, chains.WithMaxTokens(8191), chains.WithStreamingFunc(func(ctx context.Context, chunk []byte) error { c.Stream(func(w io.Writer) bool { fmt.Fprintf(w, (string(chunk))) return false }) return nil })) Closing Thoughts I really like the extensibility of LangChain. While I understand that langchaingo may not be as popular as the original Python version (I hope it will reach there in due time), it's nice to be able to use it as a foundation and build extensions as required. Previously, I had written about how to use the AWS Lambda Go Proxy API to run existing Go applications on AWS Lambda. The AWS Lambda Web Adapter offers similar functionality but it has lots of other benefits, including response streaming and the fact that it is language agnostic. Oh, and one more thing - I also tried a different approach to building this solution using the API Gateway WebSocket. Let me know if you're interested, and I would be happy to write it up! If you want to explore how to use Go for Generative AI solutions, you can read up on some of my earlier blogs: Building LangChain applications with Amazon Bedrock and Go - An introduction Serverless Image Generation Application Using Generative AI on AWS Generative AI Apps With Amazon Bedrock: Getting Started for Go Developers Use Amazon Bedrock and LangChain to build an application to chat with web pages Happy building!
Microsoft's Azure provides many services via a single cloud, which lets them offer one solution for multiple corporate infrastructures. Development teams often use Azure because they value the opportunity to run SQL databases in the cloud and complete simple operations via the Azure portal. But you'll need to have a way to back up your data, as it's crucial to ensuring the functionality of the production site and the stability of everyday workflows. So, creating Azure SQL backups can help you and your team avoid data loss emergencies and have the shortest possible downtime while maintaining control over the infrastructure. Another reason to have a current Azure database backup is Microsoft’s policy. Microsoft uses the shared responsibility model, which makes the user responsible for data integrity and recovery, while Microsoft only ensures the availability of its services. Microsoft directly recommends using third-party solutions to create database backups. In case you run a local SQL Server, you'll need to prepare for the possibility of hardware failures that may result in data loss and downtime. An SQL database on Azure helps mitigate that risk, although it's still prone to human errors or cloud-specific threats like malware. These and other threats make enabling Azure SQL database backups necessary for any organization using Microsoft’s service to manage and process data. In this tutorial, you'll learn about backing up Azure databases and restoring your data on demand with native instruments provided by Microsoft, including methods like: Built-in Azure database backup functionality Cloud archiving Secondary database and table management Linked server Stretch Database Why Backup Your SQL Azure Database? Although I covered this briefly in the intro, there are many reasons to back up your SQL Azure database data. Disaster Recovery Data centers can be damaged or destroyed by planned cyberattacks, random malware infiltration (check out this article to discover more on ransomware protection), and natural disasters like floods or hurricanes, among others. Backups can be used to recover data and restore operations after various disaster cases swiftly. Data Loss Prevention Data corruption, hardware failure, and accidental or malicious deletion lead to data loss and can threaten an organization. Backup workflows set up to run regularly mean you can quickly recover the data that was lost or corrupted. Compliance and Regulations Compliance requirements and legislative regulations can be severe regardless of your organization’s industry. Mostly, laws require you to keep up with security and perform regular backups for compliance. Testing and Development You can use backups to create Azure database copies for development, troubleshooting, or testing. Thus, you can fix, develop, or improve your organization’s workflows without involving the production environment. How To Back Up Your Azure SQL Database Backing up your Azure SQL database can be challenging if you go through the process without preparation. So that's why I wrote this guide – to help you be prepared. Here's what we'll cover in the following sections: Requirements for SQL Azure database backup. How to configure database backups in Azure with native tools. Cloud archiving Backup verification and data restoration SQL Azure Database Backup Requirements Before backing up your SQL Azure databases, you need to create and configure Azure storage. Before you do that, you'll need to go through the following steps: First, open the Azure management portal and find Create a Resource. Then, go to Storage > Storage account. Provide the information, including the location and names of a storage account and resource group, according to your preferences. After you enter the information, hit Next. Then, go to the advanced section for additional settings. The optimal choice is to set "Secure transfer required" as Enabled and "Allow access" from All networks. For more resilience in case of human error, you can set "Blob soft delete" as Enabled. With that setting, you can quickly correct accidental deletions in the storage account. After that, specify the tags you need to simplify navigating through your infrastructure. Check the settings once more. If everything is configured correctly, hit Create. Your new storage account is now created. Once the storage volume is created, it's time to configure a backup data storage container. Go to the storage account, find Containers, then hit the + Container tab there. After that, specify a name for the new container and switch the Public access level to Private (no anonymous access).You can then use the container as a backup storage (.bak files will be stored there in that case). Azure Database Backup Configuration Now, everything is set up for you to back up your SQL Azure database. Do the following to create a database backup: First, go to SQL Management Studio and establish a connection with the SQL server. After that, right-click the database that should be backed up. The context menu appears, so go to Tasks there. Then hit Back Up…. SQL server tasks backup Then find the Destination tab, and set Back up to the line to the URL there. After that, hit New container. Next, sign in to Azure. Pick the container you created before. Provide your credentials, then hit OK. You’ll see a message asking you to sign in to an Azure subscription. Then, choose the container and hit OK. Now, you'll see the configured backup destination URL listed. To start the workflow to back up your Azure data, hit OK once again. When your SQL Azure database backup is completed, the message shows up: "The backup of database ‘your database name’ completed successfully." The backup file in the target container should now be visible from the Azure portal. Keep in mind that when uploading backups to any cloud storage, you may face issues if your network connection is not fast enough. In case that’s true for you, you can reorganize your backup workflows: send backup data to a physical storage drive first and then send another copy to the cloud. Thus, you can prevent operational challenges that might appear due to network bandwidth deficiency. Cloud Archiving for Azure Database Backups Databases tend to grow in volume as the organization grows. This means that the storage space required to fit the data and that the data's backup increases significantly. Also, the original data volume prolongs the duration of full backup workflows, posing another challenge. Of course, the first way to get more storage space is to revise your data regularly and erase records that are irrelevant, outdated, or unnecessary otherwise. Still, it's sometimes difficult to determine if data will be or become unnecessary or irrelevant, especially when dealing with compliance issues. To keep your organization compliant in any case, data archiving can help you solve two problems at once: you can ensure data accessibility on one hand and save storage space on the other hand. To archive your SQL database in the cloud, you should first save that database copy to an Azure blob container. Then, to move a newly created blob to the archive tier in the Azure portal, do the following: Go to the required container where the SQL database is stored. Choose the blob that you need to move. Hit Change tier. In the Access tier dropdown menu, choose Archive. Hit Save. Additionally, the Archive storage tier is the most affordable one in Azure, meaning that you can reduce your database data TCO with it. Secondary Database and Table Management There exist several workflows that can help you set up Azure database backup archiving for your organization. When you need the data to stay in the initial database, for instance, creating a separate table and moving that data, there can be your choice. However, the filegroup of that table should stay apart from the main database and be moved to a separate disk whenever possible. Most probably, you’ll want to let users access the data you send to a separate table. To make that happen, you can create a view merging the relevant tables and redirect the requests to that view, not to the original table. Doing things that way, you can keep the data accessible while dealing with maintenance faster. SQL Server Linking If you can’t move the data to another database for internal reasons, such as special Azure backup policies, you can consider maintaining your primary database accordingly. Here, the outcome is likely to be that of the previous case, but you need to link the SQL servers or configure apps so they can send direct requests to your second server. The downside here is that your SQL database, which was supposed to be a backup one, becomes a production database and gains appropriate importance for an organization. There are two ways to create linked servers via SQL Server Management Studio (SSMS): sp_addlinkedserver (Transact-SQL) system stored procedure that creates a linked server SSMS GUI After you've ensured that you have appropriate access rights on both server instances you need to link, the network is configured appropriately to access them. SSMS is installed, you'll need to go through the following steps: First, open SSMS. Microsoft SSMS Connect to the instance where you need to establish a linked server. Then, find Object Explorer > Server Objects, then right-click Linked Servers. Pick New Linked Server from the dropdown: New linked server SSMS Then configure the server properties, including name, server type, provider, and product name: Linked server configuration SSMS Then, you'll just need to complete the security configuration, set up the server options, and complete connection testing. Original Data Deletion When you don’t need 24/7 data availability but need the data stored due to internal policies or compliance requirements, you can choose what's probably the simplest solution to increase storage space efficiency. Just back up the data that can stay unavailable and then delete the originals from the main database. Accessing any records you may need will still be possible via the backup. Stretch Database Aiming to make data management of organizations’ databases simpler, Microsoft implemented a Stretch Database feature in SQL Server 2016. With this feature, you can get an SQL backup to Azure after you send the data from the hosted database to an Azure SQL database. The method enables you to increase overall infrastructure cost-efficiency by simplifying backup workflows. To enable this workflow in your environment, develop the policy specifying the data on a hosted server to send to Azure. You don’t need to introduce any changes in applications that use the production database: SQL Server can independently get the records from the Azure SQL Database. Azure Database Backup Verification and Restoration During an SQL Azure database backup, you can choose to create such backups WITH CHECKSUMS or without them. When the workflow is complete, I recommend you use the following command: RESTORE VERIFYONLY. This command enables you to check the recoverability of backup files. To access the data, you can restore records from a backup to a different database. With Azure Automation scripts on backups, you can accelerate the restoration process, thus minimizing downtime and increasing the overall resilience of your Azure infrastructure. You need to follow only a few steps to restore an Azure SQL database to a required recovery point from a backup. Still, keep in mind that your subscription can define the available retention period, which can vary from 7 to 35 days. A native tool for backup restoration to SQL servers is Server Management Studio. To Conclude The critical nature of Azure SQL database data makes Azure SQL backups obligatory for any organization that uses this Microsoft solution. In this guide, we reviewed the process of creating SQL Azure database backup using native Microsoft tools. These tools provide data backup, backup verification, and recovery functionality, along with some automation.
Here, I am going to show the power of SQL Loader + Unix Script utility, where multiple data files can be loaded by the SQL loader with automated shell scripts. This would be useful while dealing with large chunks of data and when data needs to be moved from one system to another system. It would be suitable for a migration project where large historical data is involved. Then, it is not possible to run the SQL loader for each file and wait till it's loaded. So the best option is to keep the Unix program containing the SQL loader command running all the time. Once any file is available in the folder location then it will pick up the files from that folder location and start processing immediately. The Set Up The sample program I have done in Macbook. Installation of Oracle differs from one from Windows machine. Please go through the video that contains the detailed steps of how to install Oracle on Mac book. Get the SQL developer with Java 8 compliance. Now let us demonstrate the example. Loading Multiple Data Files in Oracle DB Table Because it is a Macbook, I have to do all the stuff inside the Oracle Virtual Machine. Let's see the below diagram of how SQL Loader works. Use Case We need to load millions of students' information onto to Student Table using shell scripts + SQL Loader Automation. The script will run all the time in the Unix server and poll for the .Dat file, and once the DAT file is in place, it will process them. Also in case any bad data is there, you need to identify them separately. This type of example is useful in a migration project, where need to load millions of historical records. From the old system, a live Feed (DAT file ) will be generated periodically and sent to the new system server. In the new system, the server file is available and will be loaded into the database using the automation Unix script. Now let's run the script. The script can run all the time on a Unix server. To achieve this, the whole code is put into the block below: Plain Text while true [some logic] done The Process 1. I have copied all the files + folder structure in the folder below. /home/oracle/Desktop/example-SQLdr 2. Refer to the below file (ls -lrth): Shell rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 11:33 DISCARD -rw-------. 1 oracle oinstall 3.5K Dec 24 11:33 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 11:33 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 12:25 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 12:29 ARCHIVE 3. As shown below, there is no data in the student table. 4. Now run the script using the nohup.out ./TestSqlLoader.sh. By doing this it will run all the time in the Unix server. 5. Now the script will run, which will load the two .dat files through the SQL loader. 6. The table should be loaded with the content of two files. 7. Now I am again deleting the table data. Just to prove the script is running all the time in the server, I will just place two DAT files from ARCHIVE to the current Directory. 8. Again place the two data files in the current directory. Shell -rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 12:53 DISCARD -rw-------. 1 oracle oinstall 4.3K Dec 24 12:53 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 12:53 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 13:02 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 13:03 ARCHIVE 9. See the student table again has loaded with all the data. 10. The script is running all the time on the server: Shell [oracle@localhost example-sqldr]$ ps -ef|grep Test oracle 30203 1 0 12:53? 00:00:00 /bin/bash ./TestSqlLoader.sh oracle 31284 31227 0 13:06 pts/1 00:00:00 grep --color=auto Test Full Source Code for Reference Python #!/bin/bash bad_ext='.bad' dis_ext='.dis' data_ext='.dat' log_ext='.log' log_folder='TASKLOG' arch_loc="ARCHIVE" bad_loc="BAD" discard_loc="DISCARD" now=$(date +"%Y.%m.%d-%H.%M.%S") log_file_name="$log_folder/TestSQLLoader_$now$log_ext" while true; do ls -a *.dat 2>/dev/null > all_data_file_list.unx for i in `cat all_data_file_list.unx` do #echo "The data file name is :-- $i" data_file_name=`basename $i .dat` echo "Before executing the sql loader command ||Starting of the script" > $log_file_name sqlldr userid=hr/oracle@orcl control=student.ctl errors=15000 log=$i$log_ext bindsize=512000000 readsize=500000 DATA=$data_file_name$data_ext BAD=$data_file_name$bad_ext DISCARD=$data_file_name$dis_ext mv $data_file_name$data_ext $arch_loc 2>/dev/null mv $data_file_name$bad_ext $bad_loc 2>/dev/null mv $data_file_name$dis_ext $discard_loc 2>/dev/null mv $data_file_name$data_ext$log_ext $log_folder 2>/dev/null echo "After Executing the sql loader command||File moved successfully" >> $log_file_name done ## halt the procesing for 2 mins sleep 1m done The CTL file is below. SQL OPTIONS (SKIP=1) LOAD DATA APPEND INTO TABLE student FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS ( id, name, dept_id ) The SQL Loader Specification control - Name of the .ctl file errors=15000- Maximum number of errors SQL Loader can allow log=$i$log_ext- Name of the log file bindsize=512000000 - Max size of bind array readsize=500000- Max size of the read buffer DATA=$data_file_name$data_ext- Name and location of data file BAD=$data_file_name$bad_ext- Name and location of bad file DISCARD=$data_file_name$dis_ext- Name and location of discard file In this way stated above, millions of records can be loaded through SQL loader + Unix Script automated way, and the above parameter can be set according to the need. Please let me know if you like this article.
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte