Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.
2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.
Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership.In DZone's 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business.This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
Exploring Text Generation With Python and GPT-4
Distributed SQL Essentials
MongoDB is one of the most reliable and robust document-oriented NoSQL databases. It allows developers to provide feature-rich applications and services with various modern built-in functionalities, like machine learning, streaming, full-text search, etc. While not a classical relational database, MongoDB is nevertheless used by a wide range of different business sectors and its use cases cover all kinds of architecture scenarios and data types. Document-oriented databases are inherently different from traditional relational ones where data are stored in tables and a single entity might be spread across several such tables. In contrast, document databases store data in separate unrelated collections, which eliminates the intrinsic heaviness of the relational model. However, given that the real world's domain models are never so simplistic to consist of unrelated separate entities, document databases (including MongoDB) provide several ways to define multi-collection connections similar to the classical databases relationships, but much lighter, more economical, and more efficient. Quarkus, the "supersonic and subatomic" Java stack, is the new kid on the block that the most trendy and influential developers are desperately grabbing and fighting over. Its modern cloud-native facilities, as well as its contrivance (compliant with the best-of-the-breed standard libraries), together with its ability to build native executables have seduced Java developers, architects, engineers, and software designers for a couple of years. We cannot go into further details here of either MongoDB or Quarkus: the reader interested in learning more is invited to check the documentation on the official MongoDB website or Quarkus website. What we are trying to achieve here is to implement a relatively complex use case consisting of CRUDing a customer-order-product domain model using Quarkus and its MongoDB extension. In an attempt to provide a real-world inspired solution, we're trying to avoid simplistic and caricatural examples based on a zero-connections single-entity model (there are dozens nowadays). So, here we go! The Domain Model The diagram below shows our customer-order-product domain model: As you can see, the central document of the model is Order, stored in a dedicated collection named Orders. An Order is an aggregate of OrderItem documents, each of which points to its associated Product. An Order document also references the Customer who placed it. In Java, this is implemented as follows: Java @MongoEntity(database = "mdb", collection="Customers") public class Customer { @BsonId private Long id; private String firstName, lastName; private InternetAddress email; private Set<Address> addresses; ... } The code above shows a fragment of the Customer class. This is a POJO (Plain Old Java Object) annotated with the @MongoEntity annotation, which parameters define the database name and the collection name. The @BsonId annotation is used in order to configure the document's unique identifier. While the most common use case is to implement the document's identifier as an instance of the ObjectID class, this would introduce a useless tidal coupling between the MongoDB-specific classes and our document. The other properties are the customer's first and last name, the email address, and a set of postal addresses. Let's look now at the Order document. Java @MongoEntity(database = "mdb", collection="Orders") public class Order { @BsonId private Long id; private DBRef customer; private Address shippingAddress; private Address billingAddress; private Set<DBRef> orderItemSet = new HashSet<>() ... } Here we need to create an association between an order and the customer who placed it. We could have embedded the associated Customer document in our Order document, but this would have been a poor design because it would have redundantly defined the same object twice. We need to use a reference to the associated Customer document, and we do this using the DBRef class. The same thing happens for the set of the associated order items where, instead of embedding the documents, we use a set of references. The rest of our domain model is quite similar and based on the same normalization ideas; for example, the OrderItem document: Java @MongoEntity(database = "mdb", collection="OrderItems") public class OrderItem { @BsonId private Long id; private DBRef product; private BigDecimal price; private int amount; ... } We need to associate the product which makes the object of the current order item. Last but not least, we have the Product document: Java @MongoEntity(database = "mdb", collection="Products") public class Product { @BsonId private Long id; private String name, description; private BigDecimal price; private Map<String, String> attributes = new HashMap<>(); ... } That's pretty much all as far as our domain model is concerned. There are, however, some additional packages that we need to look at: serializers and codecs. In order to be able to be exchanged on the wire, all our objects, be they business or purely technical ones, have to be serialized and deserialized. These operations are the responsibility of specially designated components called serializers/deserializers. As we have seen, we're using the DBRef type in order to define the association between different collections. Like any other object, a DBRef instance should be able to be serialized/deserialized. The MongoDB driver provides serializers/deserializers for the majority of the data types supposed to be used in the most common cases. However, for some reason, it doesn't provide serializers/deserializers for the DBRef type. Hence, we need to implement our own one, and this is what the serializers package does. Let's look at these classes: Java public class DBRefSerializer extends StdSerializer<DBRef> { public DBRefSerializer() { this(null); } protected DBRefSerializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public void serialize(DBRef dbRef, JsonGenerator jsonGenerator, SerializerProvider serializerProvider) throws IOException { if (dbRef != null) { jsonGenerator.writeStartObject(); jsonGenerator.writeStringField("id", (String)dbRef.getId()); jsonGenerator.writeStringField("collectionName", dbRef.getCollectionName()); jsonGenerator.writeStringField("databaseName", dbRef.getDatabaseName()); jsonGenerator.writeEndObject(); } } } This is our DBRef serializer and, as you can see, it's a Jackson serializer. This is because the quarkus-mongodb-panache extension that we're using here relies on Jackson. Perhaps, in a future release, JSON-B will be used but, for now, we're stuck with Jackson. It extends the StdSerializer class as usual and serializes its associated DBRef object by using the JSON generator, passed as an input argument, to write on the output stream the DBRef components; i.e., the object ID, the collection name, and the database name. For more information concerning the DBRef structure, please see the MongoDB documentation. The deserializer is performing the complementary operation, as shown below: Java public class DBRefDeserializer extends StdDeserializer<DBRef> { public DBRefDeserializer() { this(null); } public DBRefDeserializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public DBRef deserialize(JsonParser jsonParser, DeserializationContext deserializationContext) throws IOException, JacksonException { JsonNode node = jsonParser.getCodec().readTree(jsonParser); return new DBRef(node.findValue("databaseName").asText(), node.findValue("collectionName").asText(), node.findValue("id").asText()); } } This is pretty much all that may be said as far as the serializers/deserializers are concerned. Let's move further to see what the codecs package brings to us. Java objects are stored in a MongoDB database using the BSON (Binary JSON) format. In order to store information, the MongoDB driver needs the ability to map Java objects to their associated BSON representation. It does that on behalf of the Codec interface, which contains the required abstract methods for the mapping of the Java objects to BSON and the other way around. Implementing this interface, one can define the conversion logic between Java and BSON, and conversely. The MongoDB driver includes the required Codec implementation for the most common types but again, for some reason, when it comes to DBRef, this implementation is only a dummy one, which raises UnsupportedOperationException. Having contacted the MongoDB driver implementers, I didn't succeed in finding any other solution than implementing my own Codec mapper, as shown by the class DocstoreDBRefCodec. For brevity reasons, we won't reproduce this class' source code here. Once our dedicated Codec is implemented, we need to register it with the MongoDB driver, such that it uses it when it comes to mapping DBRef types to Java objects and conversely. In order to do that, we need to implement the interface CoderProvider which, as shown by the class DocstoreDBRefCodecProvider, returns via its abstract get() method, the concrete class responsible for performing the mapping; i.e., in our case, DocstoreDBRefCodec. And that's all we need to do here as Quarkus will automatically discover and use our CodecProvider customized implementation. Please have a look at these classes to see and understand how things are done. The Data Repositories Quarkus Panache greatly simplifies the data persistence process by supporting both the active record and the repository design patterns. Here, we'll be using the second one. As opposed to similar persistence stacks, Panache relies on the compile-time bytecode enhancements of the entities. It includes an annotation processor that automatically performs these enhancements. All that this annotation processor needs in order to perform its enhancements job is an interface like the one below: Java @ApplicationScoped public class CustomerRepository implements PanacheMongoRepositoryBase<Customer, Long>{} The code above is all that you need in order to define a complete service able to persist Customer document instances. Your interface needs to extend the PanacheMongoRepositoryBase one and parameter it with your object ID type, in our case a Long. The Panache annotation processor will generate all the required endpoints required to perform the most common CRUD operations, including but not limited to saving, updating, deleting, querying, paging, sorting, transaction handling, etc. All these details are fully explained here. Another possibility is to extend PanacheMongoRepository instead of PanacheMongoRepositoryBase and to use the provided ObjectID keys instead of customizing them as Long, as we did in our example. Whether you chose the 1st or the 2nd alternative, this is only a preference matter. The REST API In order for our Panache-generated persistence service to become effective, we need to expose it through a REST API. In the most common case, we have to manually craft this API, together with its implementation, consisting of the full set of the required REST endpoints. This fastidious and repetitive operation might be avoided by using the quarkus-mongodb-rest-data-panache extension, which annotation processor is able to automatically generate the required REST endpoints, out of interfaces having the following pattern: Java public interface CustomerResource extends PanacheMongoRepositoryResource<CustomerRepository, Customer, Long> {} Believe it if you want: this is all you need to generate a full REST API implementation with all the endpoints required to invoke the persistence service generated previously by the mongodb-panache extension annotation processor. Now we are ready to build our REST API as a Quarkus microservice. We chose to build this microservice as a Docker image, on behalf of the quarkus-container-image-jib extension. By simply including the following Maven dependency: XML <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-container-image-jib</artifactId> </dependency> The quarkus-maven-plugin will create a local Docker image to run our microservice. The parameters of this Docker image are defined by the application.properties file, as follows: Properties files quarkus.container-image.build=true quarkus.container-image.group=quarkus-nosql-tests quarkus.container-image.name=docstore-mongodb quarkus.mongodb.connection-string = mongodb://admin:admin@mongo:27017 quarkus.mongodb.database = mdb quarkus.swagger-ui.always-include=true quarkus.jib.jvm-entrypoint=/opt/jboss/container/java/run/run-java.sh Here we define the name of the newly created Docker image as being quarkus-nosql-tests/docstore-mongodb. This is the concatenation of the parameters quarkus.container-image.group and quarkus.container-image.name separated by a "/". The property quarkus.container-image.build having the value true instructs the Quarkus plugin to bind the build operation to the package phase of maven. This way, simply executing a mvn package command, we generate a Docker image able to run our microservice. This may be tested by running the docker images command. The property named quarkus.jib.jvm-entrypoint defines the command to be run by the newly generated Docker image. quarkus-run.jar is the Quarkus microservice standard startup file used when the base image is ubi8/openjdk-17-runtime, as in our case. Other properties are quarkus.mongodb.connection-string and quarkus.mongodb.database = mdb which define the MongoDB database connection string and the name of the database. Last but not least, the property quarkus.swagger-ui.always-include includes the Swagger UI interface in our microservice space such that it allows us to test it easily. Let's see now how to run and test the whole thing. Running and Testing Our Microservices Now that we looked at the details of our implementation, let's see how to run and test it. We chose to do it on behalf of the docker-compose utility. Here is the associated docker-compose.yml file: YAML version: "3.7" services: mongo: image: mongo environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: admin MONGO_INITDB_DATABASE: mdb hostname: mongo container_name: mongo ports: - "27017:27017" volumes: - ./mongo-init/:/docker-entrypoint-initdb.d/:ro mongo-express: image: mongo-express depends_on: - mongo hostname: mongo-express container_name: mongo-express links: - mongo:mongo ports: - 8081:8081 environment: ME_CONFIG_MONGODB_ADMINUSERNAME: admin ME_CONFIG_MONGODB_ADMINPASSWORD: admin ME_CONFIG_MONGODB_URL: mongodb://admin:admin@mongo:27017/ docstore: image: quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT depends_on: - mongo - mongo-express hostname: docstore container_name: docstore links: - mongo:mongo - mongo-express:mongo-express ports: - "8080:8080" - "5005:5005" environment: JAVA_DEBUG: "true" JAVA_APP_DIR: /home/jboss JAVA_APP_JAR: quarkus-run.jar This file instructs the docker-compose utility to run three services: A service named mongo running the Mongo DB 7 database A service named mongo-express running the MongoDB administrative UI A service named docstore running our Quarkus microservice We should note that the mongo service uses an initialization script mounted on the docker-entrypoint-initdb.d directory of the container. This initialization script creates the MongoDB database named mdb such that it could be used by the microservices. JavaScript db = db.getSiblingDB(process.env.MONGO_INITDB_ROOT_USERNAME); db.auth( process.env.MONGO_INITDB_ROOT_USERNAME, process.env.MONGO_INITDB_ROOT_PASSWORD, ); db = db.getSiblingDB(process.env.MONGO_INITDB_DATABASE); db.createUser( { user: "nicolas", pwd: "password1", roles: [ { role: "dbOwner", db: "mdb" }] }); db.createCollection("Customers"); db.createCollection("Products"); db.createCollection("Orders"); db.createCollection("OrderItems"); This is an initialization JavaScript that creates a user named nicolas and a new database named mdb. The user has administrative privileges on the database. Four new collections, respectively named Customers, Products, Orders and OrderItems, are created as well. In order to test the microservices, proceed as follows: Clone the associated GitHub repository: $ git clone https://github.com/nicolasduminil/docstore.git Go to the project: $ cd docstore Build the project: $ mvn clean install Check that all the required Docker containers are running: $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7882102d404d quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT "/opt/jboss/containe…" 8 seconds ago Up 6 seconds 0.0.0.0:5005->5005/tcp, :::5005->5005/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 8443/tcp docstore 786fa4fd39d6 mongo-express "/sbin/tini -- /dock…" 8 seconds ago Up 7 seconds 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp mongo-express 2e850e3233dd mongo "docker-entrypoint.s…" 9 seconds ago Up 7 seconds 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp mongo Run the integration tests: $ mvn -DskipTests=false failsafe:integration-test This last command will run all the integration tests which should succeed. These integration tests are implemented using the RESTassured library. The listing below shows one of these integration tests located in the docstore-domain project: Java @QuarkusIntegrationTest @TestMethodOrder(MethodOrderer.OrderAnnotation.class) public class CustomerResourceIT { private static Customer customer; @BeforeAll public static void beforeAll() throws AddressException { customer = new Customer("John", "Doe", new InternetAddress("john.doe@gmail.com")); customer.addAddress(new Address("Gebhard-Gerber-Allee 8", "Kornwestheim", "Germany")); customer.setId(10L); } @Test @Order(10) public void testCreateCustomerShouldSucceed() { given() .header("Content-type", "application/json") .and().body(customer) .when().post("/customer") .then() .statusCode(HttpStatus.SC_CREATED); } @Test @Order(20) public void testGetCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().get("/customer") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName[0]")).isEqualTo("John"); } @Test @Order(30) public void testUpdateCustomerShouldSucceed() { customer.setFirstName("Jane"); given() .header("Content-type", "application/json") .and().body(customer) .when().pathParam("id", customer.getId()).put("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(40) public void testGetSingleCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName")).isEqualTo("Jane"); } @Test @Order(50) public void testDeleteCustomerShouldSucceed() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).delete("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(60) public void testGetSingleCustomerShouldFail() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_NOT_FOUND); } } You can also use the Swagger UI interface for testing purposes by firing your preferred browser at http://localhost:8080/q:swagger-ui. Then, in order to test endpoints, you can use the payloads in the JSON files located in the src/resources/data directory of the docstore-api project. You also can use the MongoDB UI administrative interface by going to http://localhost:8081 and authenticating yourself with the default credentials (admin/pass). You can find the project source code in my GitHub repository. Enjoy!
DevOps proposes Continuous Integration and Continuous Delivery (CI/CD) solutions for software project management. In CI/CD, the process of software development and operations falls into a cyclical feedback loop, which promotes not only innovation and improvements but also makes the product quickly adapt to the changing needs of the market. So, it becomes easy to cater to the needs of the customer and garner their satisfaction. The development team that adopts the culture becomes agile, flexible, and adaptive (called the Agile development team) in building an incremental quality product that focuses on continuous improvement and innovation. One of the key areas of CI/CD is to address changes. The evolution of software also has its effect on the database as well. Database change management primarily focuses on this aspect and can be a real hurdle in collaboration with DevOps practices which is advocating automation for CI/CD pipelines. Automating database change management enables the development team to stay agile by keeping database schema up to date as part of the delivery and deployment process. It helps to keep track of changes critical for debugging production problems. The purpose of this article is to highlight how database change management is an important part of implementing Continuous Delivery and recommends some processes that help streamline application code and database changes into a single delivery pipeline. Continuous Integration One of the core principles of an Agile development process is Continuous Integration. Continuous Integration emphasizes making sure that code developed by multiple members of the team is always integrated. It avoids the “integration hell” that used to be so common during the days when developers worked in their silos and waited until everyone was done with their pieces of work before attempting to integrate them. Continuous Integration involves independent build machines, automated builds, and automated tests. It promotes test-driven development and the practice of making frequent atomic commits to the baseline or master branch or trunk of the version control system. Figure 1: A typical Continuous Integration process The diagram above illustrates a typical Continuous Integration process. As soon as a developer checks in code to the source control system, it will trigger a build job configured in the Continuous Integration (CI) server. The CI Job will check out code from the version control system, execute a build, run a suite of tests, and deploy the generated artifacts (e.g., a JAR file) to an artifact repository. There may be timed CI jobs to deploy the code to the development environment, push details out to a static analysis tool, run system tests on the deployed code, or any automated process that the team feels is useful to ensure that the health of the codebase is always maintained. It is the responsibility of the Agile team to make sure that if there is any failure in any of the above-mentioned automated processes, it is promptly addressed and no further commits are made to the codebase until the automated build is fixed. Continuous Delivery Continuous Delivery takes the concept of Continuous Integration a couple of steps further. In addition to making sure that different modules of a software system are always integrated, it also makes sure that the code is always deployable (to production). This means that in addition to having an automated build and a completely automated test suite, there should be an automated process of delivering the software to production. Using the automated process, it should be possible to deploy software on short notice, typically within minutes, with the click of a button. Continuous Delivery is one of the core principles of DevOps and offers many benefits including predictable deploys, reduced risk while introducing new features, shorter feedback cycles with the customer, and overall higher quality of software. Figure 2: A typical Continuous Delivery process The above diagram shows a typical Continuous Delivery process. Please note that the above-illustrated Continuous Delivery process assumes that a Continuous Integration process is already in place. The above diagram shows 2 environments: e.g., User Acceptance Test (UAT) and production. However, different organizations may have multiple staging environments (Quality Assurance or QA, load testing, pre-production, etc.) before the software makes it to the production environment. However, it is the same codebase, and more precisely, the same version of the codebase that gets deployed to different environments. Deployment to all staging environments and the production environment are performed through the same automated process. There are many tools available to manage configurations (as code) and make sure that deploys are automatic (usually self-service), controlled, repeatable, reliable, auditable, and reversible (can be rolled back). It is beyond the scope of this article to go over those DevOps tools, but the point here is to stress the fact that there must be an automated process to release software to production on demand. Database Change Management Is the Bottleneck Agile practices are pretty much mainstream nowadays when it comes to developing application code. However, we don’t see as much adoption of agile principles and continuous integration in the area of database development. Almost all enterprise applications have a database involved and thus project deliverables would involve some database-related work in addition to application code development. Therefore, slowness in the process of delivering database-related work - for example, a schema change - slows down the delivery of an entire release. In this article, we would assume the database to be a relational database management system. The processes would be very different if the database involved is a non-relational database like a columnar database, document database, or a database storing data in key-value pairs or graphs. Let me illustrate this scenario with a real example: here is this team that practices Agile software development methodologies. They follow a particular type of Agile called Scrum, and they have a 2-week Sprint. One of the stories in the current sprint is the inclusion of a new field in the document that they interchange with a downstream system. The development team estimated that the story is worth only 1 point when it comes to the development of the code. It only involves minor changes in the data access layer to save the additional field and retrieve it later when a business event occurs and causes the application to send out a document to a downstream system. However, it requires the addition of a new column to an existing table. Had there been no database changes involved, the story could have been easily completed in the current sprint, but since there is a database change involved, the development team doesn’t think it is doable in this sprint. Why? Because a schema change request needs to be sent to the Database Administrators (DBA). The DBAs will take some time to prioritize this change request and rank this against other change requests that they received from other application development teams. Once the DBAs make the changes in the development database, they will let the developers know and wait for their feedback before they promote the changes to the QA environment and other staging environments, if applicable. Developers will test changes in their code against the new schema. Finally, the development team will closely coordinate with the DBAs while scheduling delivery of application changes and database changes to production. Figure 3: Manual or semi-automated process in delivering database changes Please note in the diagram above that the process is not triggered by a developer checking in code and constitutes a handoff between two teams. Even if the deployment process on the database side is automated, it is not integrated with the delivery pipeline of application code. The changes in the application code are directly dependent on the database changes, and they together constitute a release that delivers a useful feature to the customer. Without one change, the other change is not only useless but could potentially cause regression. However, the lifecycle of both of these changes is completely independent of each other. The fact that the database and codebase changes follow independent life cycles and the fact that there are handoffs and manual checkpoints involved, the Continuous Delivery process, in this example, is broken. Recommendations To Fix CI/CD for Database Changes In the following sections, we will explain how this can be fixed and how database-related work including data modeling and schema changes, etc., can be brought under the ambit of the Continuous Delivery process. DBAs Should Be a Part of the Cross-Functional Agile Team Many organizations have their DBAs split into broadly two different types of roles based on whether they help to build a database for application development teams or maintain production databases. The primary responsibility of a production DBA is to ensure the availability of production databases. They monitor the database, take care of upgrades and patches, allocate storage, perform backup and recovery, etc. A development DBA, on the other hand, works closely with the application development team and helps them come up with data model design, converts a logical data model into a physical database schema, estimates storage requirements, etc. To bring database work and application development work into one single delivery pipeline, it is almost necessary that the development DBA be a part of the development team. Full-stack developers in the development team with good knowledge of the database may also wear the hat of a development DBA. Database as Code It is not feasible to have database changes and application code integrated into a single delivery pipeline unless database changes are treated the same way as application code. This necessitates scripting every change in the database and having them version-controlled. It should then be possible to stand up a new instance of the database automatically from the scripts on demand. If we had to capture database objects as code, we would first need to classify them and evaluate each one of those types to see if and how they need to be captured as script (code). Following is a broad classification of them: Database Structure This is basically the definition of how stored data will be structured in the database and is also known as a schema. These include table definitions, views, constraints, indexes, and types. The data dictionary may also be considered as a part of the database structure. Stored Code These are very similar to application code, except that they are stored in the database and are executed by the database engine. They include stored procedures, functions, packages, triggers, etc. Reference Data These are usually a set of permissible values that are referenced from other tables that store business data. Ideally, tables representing reference data have very few records and don’t change much over the life cycle of an application. They may change when some business process changes but don’t usually change during the normal course of business. Application Data or Business Data These are the data that the application records during the normal course of business. The main purpose of any database system is to store these data. The other three types of database objects exist only to support these data. Out of the above four types of database objects, the first three can be and should be captured as scripts and stored in a version control system. Type Example Scripted (Stored like code?) Database Structure Schema Objects like Tables, Views, Constraints, Indexes, etc. Yes Stored Code Triggers, Procedures, Functions, Packages, etc. Yes Reference Data Codes, Lookup Tables, Static data, etc. Yes Business/Application Data Data generated from day-to-day business operations No Table 1: Depicts what types of database objects can be scripted and what types can’t be scripted As shown in the table above, business data or application data are the only types that won’t be scripted or stored as code. All rollbacks, revisions, archival, etc., are handled by the database itself; however, there is one exception. When a schema change forces data migration - say, for example, populating a new column or moving data from a base table to a normalized table - that migration script should be treated as code and should follow the same life cycle as the schema change. Let's take an example of a very simple data model to illustrate how scripts may be stored as code. This model is so simple and so often used in examples, that it may be considered the “Hello, World!” of data modeling. Figure 4: Example model with tables containing business data and ones containing reference data In the model above, a customer may be associated with zero or more addresses, like billing address, shipping address, etc. The table AddressType stores the different types of addresses like billing, shipping, residential, work, etc. The data stored in AddressType can be considered reference data as they are not supposed to grow during day-to-day business operations. On the other hand, the other tables contain business data. As the business finds more and more customers, the other tables will continue to grow. Example Scripts: Tables: Constraints: Reference Data: We won’t get into any more details and cover each type of database object. The purpose of the examples is to illustrate that all database objects, except for business data, can be and should be captured in SQL scripts. Version Control Database Artifacts in the Same Repository as Application Code Keeping the database artifacts in the same repository of the version control system as the application code offers a lot of advantages. They can be tagged and released together since, in most cases, a change in database schema also involves a change in application code, and they together constitute a release. Having them together also reduces the possibility of application code and the database getting out of sync. Another advantage is just plain readability. It is easier for a new team member to come up to speed if everything related to a project is in a single place. Figure 5: Example structure of a Java Maven project containing database code The above screenshot shows how database scripts can be stored alongside application code. Our example is a Java application, structured as a Maven project. The concept is however agnostic of what technology is used to build the application. Even if it was a Ruby or a .NET application, we would store the database objects as scripts alongside application code to let CI/CD automation tools find them in one place and perform necessary operations on them like building the schema from scratch or generating migration scripts for a production deployment. Integrate Database Artifacts Into the Build Scripts It is important to include database scripts in the build process to ensure that database changes go hand in hand with application code in the same delivery pipeline. Database artifacts are usually SQL scripts of some form and all major build tools support executing SQL scripts either natively or via plugins. We won’t get into any specific build technology but will list down the tasks that the build would need to perform. Here we are talking about builds in local environments or CI servers. We will talk about builds in staging environments and production at a later stage. The typical tasks involved are: Drop Schema Create Schema Create Database Structure (or schema objects): They include tables, constraints, indexes, sequences, and synonyms. Deploy stored code, like procedures, functions, packages, etc. Load reference data Load TEST data If the build tool in question supports build phases, this will typically be in the phase before integration tests. This ensures that the database will be in a stable state with a known set of data loaded. There should be sufficient integration tests that will cause the build to fail if the application code goes out of sync with the data model. This ensures that the database is always integrated with the application code: the first step in achieving a Continuous Delivery model involving database change management. Figure 6: Screenshot of code snippet showing a Maven build for running database scripts The above screenshot illustrates the usage of a Maven plugin to run SQL scripts. It drops the schema, recreates it, and runs all the DDL scripts to create tables, constraints, indexes, sequences, and synonyms. Then it deploys all the stored code into the database and finally loads all reference data and test data. Refactor Data Model as Needed Agile methodology encourages evolutionary design over upfront design; however, many organizations that claim to be Agile shops, actually perform an upfront design when it comes to data modeling. There is a perception that schema changes are difficult to implement later in the game, and thus it is important to get it right the first time. If the recommendations made in the previous sections are made, like having an integrated team with developers and DBAs, scripting database changes, and version controlling them alongside application code, it won’t be difficult to automate all schema changes. Once the deployment and rollback of database changes are fully automated and there is a suite of automated tests in place, it should be easy to mitigate risks in refactoring schema. Avoid Shared Database Having a database schema shared by more than one application is a bad idea, but they still exist. There is even a mention of a “Shared Database” as an integration pattern in a famous book on enterprise integration patterns, Enterprise Integration Patterns by Gregor Holpe and Bobby Woolf. Any effort to bring application code and database changes under the same delivery pipeline won’t work unless the database truly belongs to the application and is not shared by other applications. However, this is not the only reason why a shared database should be avoided. "Shared Database" also causes tight coupling between applications and a multitude of other problems. Dedicated Schema for Every Committer and CI Server Developers should be able to work on their own sandboxes without the fear of breaking anything in a common environment like the development database instance; similarly, there should be a dedicated sandbox for the CI server as well. This follows the pattern of how application code is developed. A developer makes changes and runs the build locally, and if the build succeeds and all the tests pass, (s)he commits the changes. The sandbox could be either an independent database instance, typically installed locally on the developer’s machine, or it could be a different schema in a shared database instance. Figure 7: Developers make changes in their local environment and commit frequently As shown in the above diagram, each developer has their own copy of the schema. When a full build is performed, in addition to building the application, it also builds the database schema from scratch. It drops the schema, recreates it, and executes DDL scripts to load all schema objects like tables, views, sequences, constraints, and indexes. It creates objects representing stored code, like functions, procedures, packages, and triggers. Finally, it loads all the reference data and test data. Automated tests ensure that the application code and database object are always in sync. It must be noted that data model changes are less frequent than application code, so the build script should have the option to skip the database build for the sake of build performance. The CI build job should also be set up to have its own sandbox of the database. The build script performs a full build that includes building the application as well as building the database schema from scratch. It runs a suite of automated tests to ensure that the application itself and the database that it interacts with, are in sync. Figure 8: Revised CI process with integration of database build with build of application code Please note that the similarity of the process described in the above diagram with the one described in Figure 1. The build machine or the CI server contains a build job that is triggered by any commit to the repository. The build that it performs includes both the application build and the database build. The database scripts are now always integrated, just like application code. Dealing With Migrations The process described above would build the database schema objects, stored code, reference data, and test data from scratch. This is all good for continuous integration and local environments. This process won’t work for the production database and even QA or UAT environments. The real purpose of any database is storing business data, and every other database object exists only to support business data. Dropping schema and recreating it from scripts is not an option for a database currently running business transactions. In this case, there is a need for scripting deltas, i.e., the changes that will transition the database structure from a known state (a particular release of software) to a desired state. The transition will also include any data migration. Schema changes may lead to a requirement to migrate data as well. For example, as a result of normalization, data from one table may need to be migrated to one or more child tables. In such cases, a script that transforms data from the parent table to the children should also be a part of the migration scripts. Schema changes may be scripted and maintained in the source code repository so that they are part of the build. These scripts may be hand-coded during active development, but there are tools available to automate that process as well. One such tool is Flyway, which can generate migration scripts for the transition of one state of schema to another state. Figure 9: Automation of schema migrations and rollback In the above picture, the left-hand side shows the current state of the database which is in sync with the application release 1.0.1 (the previous release). The right-hand side shows the desired state of the database in the next release. We have the state on the left-hand side captured and tagged in the version control system. The right-hand side is also captured in the version control system as the baseline, master branch, or trunk. The difference between the right-hand side and the left-hand side is what needs to be applied to the database in the staging environments and the production environment. The differences may be manually tracked and scripted, which is laborious and error-prone. The above diagram illustrates that tools like Flyway can automate the creation of such differences in the form of migration scripts. The automated process will create the following: Migration script (to transition the database from the prior release to the new release) Rollback script (to transition the database back to the previous release). The generated scripts will be tagged and stored with other deploy artifacts. This automation process may be integrated with the Continuous Delivery process to ensure repeatable, reliable, and reversible (ability to rollback) database changes. Continuous Delivery With Database Changes Incorporated Into It Let us now put the pieces together. There is a Continuous Integration process already in place that rebuilds the database along with the application code. We have a process in place that generates migration scripts for the database. These generated migration scripts are a part of the deployment artifacts. The DevOps tools will use these released artifacts to build any of the staging environments or the production environment. The deployment artifacts will also contain rollback scripts to support self-service rollback. If anything goes wrong, the previous version of the application may then be redeployed and the database rollback script shall be run to transition the database schema to the previous state that is in sync with the previous release of the application code. Figure 10: Continuous Delivery incorporating database changes The above diagram depicts a Continuous Delivery process that has database change management incorporated into it. This assumes that a Continuous Integration process is already there in place. When a UAT (or any other staging environment like TEST, QA, etc.) deployment is initiated, the automated processes take care of creating a tag in the source control repository, building application deployable artifacts from the tagged codebase, generating database migration scripts, assembling the artifacts and deploying. The deployment process includes the deployment of the application as well as applying migration scripts to the database. The same artifacts will be used to deploy the application to the production environment, following the approval process. A rollback would involve redeploying the previous release of the application and running the database rollback script. Tools Available in the Market The previous sections primarily describe how to achieve CI/CD in a project that involves database changes by following some processes but don’t particularly take into consideration any tools that help in achieving them. The above recommendations are independent of any particular tool. A homegrown solution can be developed using common automation tools like Maven or Gradle for build automation, Jenkins or TravisCI for Continuous Integration, and Chef or Puppet for configuration management; however, there are many tools available in the marketplace, that specifically deal with Database DevOps. Those tools may also be taken advantage of. Some examples are: Datical Redgate Liquibase Flyway Conclusion Continuous Integration and Continuous Delivery processes offer tremendous benefits to organizations, like accelerated time to market, reliable releases, and overall higher-quality software. Database change management is traditionally cautious and slow. In many cases, database changes involve manual processes and often cause a bottleneck to the Continuous Delivery process. The processes and best practices mentioned in this article, along with available tools in the market, should hopefully eliminate this bottleneck and help to bring database changes into the same delivery pipeline as application code.
In a previous blog, the influence of the document format and the way it is embedded in combination with semantic search was discussed. LangChain4j was used to accomplish this. The way the document was embedded has a major influence on the results. This was one of the main conclusions. However, a perfect result was not achieved. In this post, you will take a look at Weaviate, a vector database that has a Java client library available. You will investigate whether better results can be achieved. The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. Parts of these documents are converted to Markdown in order to have a better representation. The same documents were used in the previous blog, so it will be interesting to see how the findings from that post compare to the approach used in this post. The sources used in this blog can be found on GitHub. Prerequisites The prerequisites for this blog are: Basic knowledge of embedding and vector stores Basic Java knowledge, Java 21 is used Basic knowledge of Docker The Weaviate starter guides are also interesting reading material. How to Implement Vector Similarity Search 1. Installing Weaviate There are several ways to install Weaviate. An easy installation is through Docker Compose. Just use the sample Docker Compose file. YAML version: '3.4' services: weaviate: command: - --host - 0.0.0.0 - --port - '8080' - --scheme - http image: semitechnologies/weaviate:1.23.2 ports: - 8080:8080 - 50051:50051 volumes: - weaviate_data:/var/lib/weaviate restart: on-failure:0 environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'none' ENABLE_MODULES: 'text2vec-cohere,text2vec-huggingface,text2vec-palm,text2vec-openai,generative-openai,generative-cohere,generative-palm,ref2vec-centroid,reranker-cohere,qna-openai' CLUSTER_HOSTNAME: 'node1' volumes: weaviate_data: Start the Compose file from the root of the repository. Shell $ docker compose -f docker/compose-initial.yaml up You can shut it down with CTRL+C or with the following command: Shell $ docker compose -f docker/compose-initial.yaml down 2. Connect to Weaviate First, let’s try to connect to Weaviate through the Java library. Add the following dependency to the pom file: XML <dependency> <groupId>io.weaviate</groupId> <artifactId>client</artifactId> <version>4.5.1</version> </dependency> The following code will create a connection to Weaviate and display some metadata information about the instance. Java Config config = new Config("http", "localhost:8080"); WeaviateClient client = new WeaviateClient(config); Result<Meta> meta = client.misc().metaGetter().run(); if (meta.getError() == null) { System.out.printf("meta.hostname: %s\n", meta.getResult().getHostname()); System.out.printf("meta.version: %s\n", meta.getResult().getVersion()); System.out.printf("meta.modules: %s\n", meta.getResult().getModules()); } else { System.out.printf("Error: %s\n", meta.getError().getMessages()); } The output is the following: Shell meta.hostname: http://[::]:8080 meta.version: 1.23.2 meta.modules: {generative-cohere={documentationHref=https://docs.cohere.com/reference/generate, name=Generative Search - Cohere}, generative-openai={documentationHref=https://platform.openai.com/docs/api-reference/completions, name=Generative Search - OpenAI}, generative-palm={documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/chat/test-chat-prompts, name=Generative Search - Google PaLM}, qna-openai={documentationHref=https://platform.openai.com/docs/api-reference/completions, name=OpenAI Question & Answering Module}, ref2vec-centroid={}, reranker-cohere={documentationHref=https://txt.cohere.com/rerank/, name=Reranker - Cohere}, text2vec-cohere={documentationHref=https://docs.cohere.ai/embedding-wiki/, name=Cohere Module}, text2vec-huggingface={documentationHref=https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task, name=Hugging Face Module}, text2vec-openai={documentationHref=https://platform.openai.com/docs/guides/embeddings/what-are-embeddings, name=OpenAI Module}, text2vec-palm={documentationHref=https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings, name=Google PaLM Module} The version is shown and the modules that were activated, this corresponds to the modules activated in the docker compose file. 3. Embed Documents In order to query the documents, the documents need to be embedded first. This can be done by means of the text2vec-transformers module. Create a new Docker Compose file with only the text2vec-transformers module enabled. You also set this module as DEFAULT_VECTORIZER_MODULE, set the TRANSFORMERS_INFERENCE_API to the transformer container and you use the sentence-transformers-all-MiniLM-L6-v2-onnx image for the transformer container. You use the ONNX image when you do not make use of a GPU. YAML version: '3.4' services: weaviate: command: - --host - 0.0.0.0 - --port - '8080' - --scheme - http image: semitechnologies/weaviate:1.23.2 ports: - 8080:8080 - 50051:50051 volumes: - weaviate_data:/var/lib/weaviate restart: on-failure:0 environment: QUERY_DEFAULTS_LIMIT: 25 AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true' PERSISTENCE_DATA_PATH: '/var/lib/weaviate' DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers' ENABLE_MODULES: 'text2vec-transformers' TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080 CLUSTER_HOSTNAME: 'node1' t2v-transformers: image: semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2-onnx volumes: weaviate_data: Start the containers: Shell $ docker compose -f docker/compose-embed.yaml up Embedding the data is an important step that needs to be executed thoroughly. It is therefore important to know the Weaviate concepts. Every data object belongs to a Class, and a class has one or more Properties. A Class can be seen as a collection and every data object (represented as JSON-documents) can be represented by a vector (i.e. an embedding). Every Class contains objects which belong to this class, which corresponds to a common schema. Three markdown files with data of Bruce Springsteen are available. The embedding will be done as follows: Every markdown file will be converted to a Weaviate Class. A markdown file consists out of a header. The header contains the column names, which will be converted into Weaviate Properties. Properties need to be valid GraphQL names. Therefore, the column names have been altered a bit compared to the previous blog. E.g. writer(s) has become writers, album details has become AlbumDetails, etc. After the header, the data is present. Ever row in the table will be converted to a data object belonging to a Class. An example of a markdown file is the Compilation Albums file. Markdown | Title | US | AUS | CAN | GER | IRE | NLD | NZ | NOR | SWE | UK | |----------------------------------|----|-----|-----|-----|-----|-----|----|-----|-----|----| | Greatest Hits | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | | Tracks | 27 | 97 | — | 63 | — | 36 | — | 4 | 11 | 50 | | 18 Tracks | 64 | 98 | 58 | 8 | 20 | 69 | — | 2 | 1 | 23 | | The Essential Bruce Springsteen | 14 | 41 | — | — | 5 | 22 | — | 4 | 2 | 15 | | Greatest Hits | 43 | 17 | 21 | 25 | 2 | 4 | 3 | 3 | 1 | 3 | | The Promise | 16 | 22 | 27 | 1 | 4 | 4 | 30 | 1 | 1 | 7 | | Collection: 1973–2012 | — | 6 | — | 23 | 2 | 78 | 19 | 1 | 6 | — | | Chapter and Verse | 5 | 2 | 21 | 4 | 2 | 5 | 4 | 3 | 2 | 2 | In the next sections, the steps taken to embed the documents are explained in more detail. The complete source code is available at GitHub. This is not the most clean code, but I do hope it is understandable. 3.1 Basic Setup A map is created, which contains the file names linked to the Weaviate Class names to be used. Java private static Map<String, String> documentNames = Map.of( "bruce_springsteen_list_of_songs_recorded.md", "Songs", "bruce_springsteen_discography_compilation_albums.md", "CompilationAlbums", "bruce_springsteen_discography_studio_albums.md", "StudioAlbums"); In the basic setup, a connection is set up to Weaviate, all data is removed from the database, and the files are read. Every file is then processed one by one. Java Config config = new Config("http", "localhost:8080"); WeaviateClient client = new WeaviateClient(config); // Remove existing data Result<Boolean> deleteResult = client.schema().allDeleter().run(); if (deleteResult.hasErrors()) { System.out.println(new GsonBuilder().setPrettyPrinting().create().toJson(deleteResult.getResult())); } List<Document> documents = loadDocuments(toPath("markdown-files")); for (Document document : documents) { ... } 3.2 Convert Header to Class The header information needs to be converted to a Weaviate Class. Split the complete file row by row. The first line contains the header, split it by means of the | separator and store it in variable tempSplittedHeader. The header starts with a | and therefore the first entry in tempSplittedHeader is empty. Remove it and store the remaining part of the row in variable splittedHeader. For every item in splittedHeader (i.e. the column names), a Weaviate Property is created. Strip all leading and trailing spaces from the data. Create the Weaviate documentClass with the class name as defined in the documentNames map and the just created Properties. Add the class to the schema and verify the result. Java // Split the document line by line String[] splittedDocument = document.text().split("\n"); // split the header on | and remove the first item (the line starts with | and the first item is therefore empty) String[] tempSplittedHeader = splittedDocument[0].split("\\|"); String[] splittedHeader = Arrays.copyOfRange(tempSplittedHeader,1, tempSplittedHeader.length); // Create the Weaviate collection, every item in the header is a Property ArrayList<Property> properties = new ArrayList<>(); for (String splittedHeaderItem : splittedHeader) { Property property = Property.builder().name(splittedHeaderItem.strip()).build(); properties.add(property); } WeaviateClass documentClass = WeaviateClass.builder() .className(documentNames.get(document.metadata("file_name"))) .properties(properties) .build(); // Add the class to the schema Result<Boolean> collectionResult = client.schema().classCreator() .withClass(documentClass) .run(); if (collectionResult.hasErrors()) { System.out.println("Creation of collection failed: " + documentNames.get(document.metadata("file_name"))); } 3.3 Convert Data Rows to Objects Every data row needs to be converted to a Weaviate data object. Copy the rows containing data in variable dataOnly. Loop over every row, a row is represented by variable documentLine. Split every line by means of the | separator and store it in variable tempSplittedDocumentLine. Just like the header, every row starts with a |, and therefore, the first entry in tempSplittedDocumentLine is empty. Remove it and store the remaining part of the row in variable splittedDocumentLine. Every item in the row becomes a property. The complete row is converted to properties in variable propertiesDocumentLine. Strip all leading and trailing spaces from the data. Add the data object to the Class and verify the result. At the end, print the result. Java // Preserve only the rows containing data, the first two rows contain the header String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length); for (String documentLine : dataOnly) { // split a data row on | and remove the first item (the line starts with | and the first item is therefore empty) String[] tempSplittedDocumentLine = documentLine.split("\\|"); String[] splittedDocumentLine = Arrays.copyOfRange(tempSplittedDocumentLine, 1, tempSplittedDocumentLine.length); // Every item becomes a property HashMap<String, Object> propertiesDocumentLine = new HashMap<>(); int i = 0; for (Property property : properties) { propertiesDocumentLine.put(property.getName(), splittedDocumentLine[i].strip()); i++; } Result<WeaviateObject> objectResult = client.data().creator() .withClassName(documentNames.get(document.metadata("file_name"))) .withProperties(propertiesDocumentLine) .run(); if (objectResult.hasErrors()) { System.out.println("Creation of object failed: " + propertiesDocumentLine); } String json = new GsonBuilder().setPrettyPrinting().create().toJson(objectResult.getResult()); System.out.println(json); } 3.4 The Result Running the code to embed the documents prints what is stored in the Weaviate vector database. As you can see below, a data object has a UUID, the class is StudioAlbums, the properties are listed and the corresponding vector is displayed. Shell { "id": "e0d5e1a3-61ad-401d-a264-f95a9a901d82", "class": "StudioAlbums", "creationTimeUnix": 1705842658470, "lastUpdateTimeUnix": 1705842658470, "properties": { "aUS": "3", "cAN": "8", "gER": "1", "iRE": "2", "nLD": "1", "nOR": "1", "nZ": "4", "sWE": "1", "title": "Only the Strong Survive", "uK": "2", "uS": "8" }, "vector": [ -0.033715352, -0.07489116, -0.015459526, -0.025204511, ... 0.03576842, -0.010400549, -0.075309984, -0.046005197, 0.09666792, 0.0051724687, -0.015554721, 0.041699238, -0.09749843, 0.052182134, -0.0023900834 ] } 4. Manage Collections So, now you have data in the vector database. What kind of information can be retrieved from the database? You are able to manage the collection, for example. 4.1 Retrieve Collection Definition The definition of a collection can be retrieved as follows: Java String className = "CompilationAlbums"; Result<WeaviateClass> result = client.schema().classGetter() .withClassName(className) .run(); String json = new GsonBuilder().setPrettyPrinting().create().toJson(result.getResult()); System.out.println(json); The output is the following: Shell { "class": "CompilationAlbums", "description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024", "invertedIndexConfig": { "bm25": { "k1": 1.2, "b": 0.75 }, "stopwords": { "preset": "en" }, "cleanupIntervalSeconds": 60 }, "moduleConfig": { "text2vec-transformers": { "poolingStrategy": "masked_mean", "vectorizeClassName": true } }, "properties": [ { "name": "uS", "dataType": [ "text" ], "description": "This property was generated by Weaviate\u0027s auto-schema feature on Sun Jan 21 13:10:58 2024", "tokenization": "word", "indexFilterable": true, "indexSearchable": true, "moduleConfig": { "text2vec-transformers": { "skip": false, "vectorizePropertyName": false } } }, ... } You can see how it was vectorized, the properties, etc. 4.2 Retrieve Collection Objects Can you also retrieve the collection objects? Yes, you can, but this is not possible at the moment of writing with the java client library. You will notice, when browsing the Weaviate documentation, that there is no example code for the java client library. However, you can make use of the GraphQL API which can also be called from java code. The code to retrieve the title property of every data object in the CompilationAlbums Class is the following: You call the graphQL method from the Weaviate client. You define the Weaviate Class and the fields you want to retrieve. You print the result. Java Field song = Field.builder().name("title").build(); Result<GraphQLResponse> result = client.graphQL().get() .withClassName("CompilationAlbums") .withFields(song) .run(); if (result.hasErrors()) { System.out.println(result.getError()); return; } System.out.println(result.getResult()); The result shows you all the titles: Shell GraphQLResponse( data={ Get={ CompilationAlbums=[ {title=Chapter and Verse}, {title=The Promise}, {title=Greatest Hits}, {title=Tracks}, {title=18 Tracks}, {title=The Essential Bruce Springsteen}, {title=Collection: 1973–2012}, {title=Greatest Hits} ] } }, errors=null) 5. Semantic Search The whole purpose of embedding the documents is to verify whether you can search the documents. In order to search, you also need to make use of the GraphQL API. Different search operators are available. Just like in the previous blog, 5 questions are asked about the data. on which album was “adam raised a cain” originally released?The answer is “Darkness on the Edge of Town”. what is the highest chart position of “Greetings from Asbury Park, N.J.” in the US?This answer is #60. what is the highest chart position of the album “tracks” in canada?The album did not have a chart position in Canada. in which year was “Highway Patrolman” released?The answer is 1982. who produced “all or nothin’ at all”?The answer is Jon Landau, Chuck Plotkin, Bruce Springsteen and Roy Bittan. In the source code, you provide the class name and the corresponding fields. This information is added in a static class for each collection. The code contains the following: Create a connection to Weaviate. Add the fields of the class and also add two additional fields, the certainty and the distance. Embed the question using a NearTextArgument. Search the collection via the GraphQL API, limit the result to 1. Print the result. Java private static void askQuestion(String className, Field[] fields, String question) { Config config = new Config("http", "localhost:8080"); WeaviateClient client = new WeaviateClient(config); Field additional = Field.builder() .name("_additional") .fields(Field.builder().name("certainty").build(), // only supported if distance==cosine Field.builder().name("distance").build() // always supported ).build(); Field[] allFields = Arrays.copyOf(fields, fields.length + 1); allFields[fields.length] = additional; // Embed the question NearTextArgument nearText = NearTextArgument.builder() .concepts(new String[]{question}) .build(); Result<GraphQLResponse> result = client.graphQL().get() .withClassName(className) .withFields(allFields) .withNearText(nearText) .withLimit(1) .run(); if (result.hasErrors()) { System.out.println(result.getError()); return; } System.out.println(result.getResult()); } Invoke this method for the five questions. Java askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?"); askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?"); askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "what is the highest chart position of the album \"tracks\" in canada?"); askQuestion(Song.NAME, Song.getFields(), "in which year was \"Highway Patrolman\" released?"); askQuestion(Song.NAME, Song.getFields(), "who produced \"all or nothin' at all?\""); The result is amazing, for all five questions the correct data object is returned. Shell GraphQLResponse( data={ Get={ Songs=[ {_additional={certainty=0.7534831166267395, distance=0.49303377}, originalRelease=Darkness on the Edge of Town, producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978} ] } }, errors=null) GraphQLResponse( data={ Get={ StudioAlbums=[ {_additional={certainty=0.803815484046936, distance=0.39236903}, aUS=71, cAN=—, gER=—, iRE=—, nLD=—, nOR=—, nZ=—, sWE=35, title=Greetings from Asbury Park,N.J., uK=41, uS=60} ] } }, errors=null) GraphQLResponse( data={ Get={ CompilationAlbums=[ {_additional={certainty=0.7434340119361877, distance=0.513132}, aUS=97, cAN=—, gER=63, iRE=—, nLD=36, nOR=4, nZ=—, sWE=11, title=Tracks, uK=50, uS=27} ] } }, errors=null) GraphQLResponse( data={ Get={ Songs=[ {_additional={certainty=0.743279218673706, distance=0.51344156}, originalRelease=Nebraska, producers=Bruce Springsteen, song="Highway Patrolman", writers=Bruce Springsteen, year=1982} ] } }, errors=null) GraphQLResponse( data={ Get={ Songs=[ {_additional={certainty=0.7136414051055908, distance=0.5727172}, originalRelease=Human Touch, producers=Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan, song="All or Nothin' at All", writers=Bruce Springsteen, year=1992} ] } }, errors=null) 6. Explore Collections The semantic search implementation assumed that you knew in which collection to search the answer. Most of the time, you do not know which collection to search for. The explore function can help in order to search across multiple collections. There are some limitations to the use of the explore function: Only one vectorizer module may be enabled. The vector search must be nearText or nearVector. The askQuestion method becomes the following. Just like in the previous paragraph, you want to return some additional, more generic fields of the collection. The question is embedded in a NearTextArgument and the collections are explored. Java private static void askQuestion(String question) { Config config = new Config("http", "localhost:8080"); WeaviateClient client = new WeaviateClient(config); ExploreFields[] fields = new ExploreFields[]{ ExploreFields.CERTAINTY, // only supported if distance==cosine ExploreFields.DISTANCE, // always supported ExploreFields.BEACON, ExploreFields.CLASS_NAME }; NearTextArgument nearText = NearTextArgument.builder().concepts(new String[]{question}).build(); Result<GraphQLResponse> result = client.graphQL().explore() .withFields(fields) .withNearText(nearText) .run(); if (result.hasErrors()) { System.out.println(result.getError()); return; } System.out.println(result.getResult()); } Running this code returns an error. A bug is reported, because a vague error is returned. Shell GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])]) GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])]) GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])]) GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])]) GraphQLResponse(data={Explore=null}, errors=[GraphQLError(message=runtime error: invalid memory address or nil pointer dereference, path=[Explore], locations=[GraphQLErrorLocationsItems(column=2, line=1)])]) However, in order to circumvent this error, it would be interesting to verify whether the correct answer returns the highest certainty over all collections. Therefore, for each question every collection is queried. The complete code can be found here, below only the code for question 1 is shown. The askQuestion implementation is the one used in the Semantic Search paragraph. Java private static void question1() { askQuestion(Song.NAME, Song.getFields(), "on which album was \"adam raised a cain\" originally released?"); askQuestion(StudioAlbum.NAME, StudioAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?"); askQuestion(CompilationAlbum.NAME, CompilationAlbum.getFields(), "on which album was \"adam raised a cain\" originally released?"); } Running this code returns the following output. Shell GraphQLResponse(data={Get={Songs=[{_additional={certainty=0.7534831166267395, distance=0.49303377}, originalRelease=Darkness on the Edge of Town, producers=Jon Landau Bruce Springsteen Steven Van Zandt (assistant), song="Adam Raised a Cain", writers=Bruce Springsteen, year=1978}]}, errors=null) GraphQLResponse(data={Get={StudioAlbums=[{_additional={certainty=0.657206118106842, distance=0.68558776}, aUS=9, cAN=7, gER=—, iRE=73, nLD=4, nOR=12, nZ=11, sWE=9, title=Darkness on the Edge of Town, uK=14, uS=5}]}, errors=null) GraphQLResponse(data={Get={CompilationAlbums=[{_additional={certainty=0.6488107144832611, distance=0.7023786}, aUS=97, cAN=—, gER=63, iRE=—, nLD=36, nOR=4, nZ=—, sWE=11, title=Tracks, uK=50, uS=27}]}, errors=null) The interesting parts here are the certainties: Collection Songs has a certainty of 0.75 Collection StudioAlbums has a certainty of 0.62 Collection CompilationAlbums has a certainty of 0.64 The correct answer can be found in the collection of songs that has the highest certainty. So, this is great. When you verify this for the other questions, you will see that the collection containing the correct answer, always has the highest certainty. Conclusion In this post, you transformed the source documents in order to fit in a vector database. The semantic search results are amazing. In the previous posts, it was kind of a struggle to retrieve the correct answers to the questions. By restructuring the data and by only using a vector semantic search a 100% score of correct answers has been achieved.
Trained neural networks arrive at solutions that achieve superhuman performance on an increasing number of tasks. It would be at least interesting and probably important to understand these solutions. Interesting, in the spirit of curiosity and getting answers to questions like, “Are there human-understandable algorithms that capture how object-detection nets work?”[a] This would add a new modality of use to our relationship with neural nets from just querying for answers (Oracle Models) or sending on tasks (Agent Models) to acquiring an enriched understanding of our world by studying the interpretable internals of these networks’ solutions (Microscope Models). [1] And important in its use in the pursuit of the kinds of standards that we (should?) demand of increasingly powerful systems, such as operational transparency, and guarantees on behavioral bounds. A common example of an idealized capability we could hope for is “lie detection” by monitoring the model’s internal state. [2] Mechanistic interpretability (mech interp) is a subfield of interpretability research that seeks a granular understanding of these networks. One could describe two categories of mech interp inquiry: Representation interpretability: Understanding what a model sees and how it does; i.e., what information have models found important to look for in their inputs and how is this information represented internally? Algorithmic interpretability: Understanding how this information is used for computation across the model to result in some observed outcome Figure 1: “A Conscious Blackbox," the cover graphic for James C. Scott’s Seeing Like a State (1998) This post is concerned with representation interpretability. Structured as an exposition of neural network representation research [b], it discusses various qualities of model representations which range in epistemic confidence from the obvious to the speculative and the merely desired. Notes: I’ll use "Models/Neural Nets" and "Model Components" interchangeably. A model component can be thought of as a layer or some other conceptually meaningful ensemble of layers in a network. Until properly introduced with a technical definition, I use expressions like “input-properties” and “input-qualities” in place of the more colloquially used “feature.” Now, to some foundational hypotheses about neural network representations. Decomposability The representations of inputs to a model are a composition of encodings of discrete information. That is, when a model looks for different qualities in an input, the representation of the input in some component of the model can be described as a combination of its representations of these qualities. This makes (de)composability a corollary of “encoding discrete information”- the model’s ability to represent a fixed set of different qualities as seen in its inputs. Figure 2: A model layer trained on a task that needs it to care about background colors (trained on only blue and red) and center shapes (only circles and triangles) The component has dedicated a different neuron to the input qualities: "background color is composed of red," "background color is composed of blue," "center object is a circle,” and “center object is a triangle.” Consider the alternative: if a model didn't identify any predictive discrete qualities of inputs in the course of training. To do well on a task, the network would have to work like a lookup table with its keys as the bare input pixels (since it can’t glean any discrete properties more interesting than “the ordered set of input pixels”) pointing to unique identifiers. We have a name for this in practice: memorizing. Therefore, saying, “Model components learn to identify useful discrete qualities of inputs and compose them to get internal representations used for downstream computation,” is not far off from saying “Sometimes, neural nets don’t completely memorize.” Figure 3: An example of how learning discrete input qualities affords generalization or robustness This example test input, not seen in training, has a representation expressed in the learned qualities. While the model might not fully appreciate what “purple” is, it’ll be better off than if it was just trying to do a table lookup for input pixels. Revisiting the hypothesis: "The representations of inputs to a model are a composition of encodings of discrete information." While, as we’ve seen, this verges on the obvious; it provides a template for introducing stricter specifications deserving of study. The first of these specification revisits looks at “…are a composition of encodings…” What is observed, speculated, and hoped for about the nature of these compositions of the encodings? Linearity To recap decomposition, we expect (non-memorizing) neural networks to identify and encode varied information from input qualities/properties. This implies that any activation state is a composition of these encodings. Figure 4: What the decomposability hypothesis suggests What is the nature of this composition? In this context, saying a representation is linear suggests the information of discrete input qualities are encoded as directions in activation space and they are composed into a representation by a vector sum: We’ll investigate both claims. Claim #1: Encoded Qualities Are Directions in Activation Space Composability already suggests that the representation of input in some model components (a vector in activation space) is composed of discrete encodings of input qualities (other vectors in activation space). The additional thing said here is that in a given input-quality encoding, we can think of there being some core essence of the quality which is the vector’s direction. This makes any particular encoding vector just a scaled version of this direction (unit vector.) Figure 5: Various encoding vectors for the red-ness quality in the input They are all just scaled representations of some fundamental red-ness unit vector, which specifies direction. This is simply a generalization of the composability argument that says neural networks can learn to make their encodings of input qualities "intensity"-sensitive by scaling some characteristic unit vector. Alternative Impractical Encoding Regimes Figure 6a An alternative encoding scheme could be that all we can get from models are binary encodings of properties; e.g., “The Red values in this RGB input are Non-zero.” This is clearly not very robust. Figure 6b Another is that we have multiple unique directions for qualities that could be described by mere differences in scale of some more fundamental quality: “One Neuron for "kind-of-red" for 0-127 in the RGB input, another for "really-red" for 128-255 in the RGB input.” We’d run out of directions fairly quickly. Claim #2: These Encodings Are Composed as a Vector Sum Now, this is the stronger of the two claims as it is not necessarily a consequence of anything introduced thus far. Figure 7: An example of 2-property representation Note: We assume independence between properties, ignoring the degenerate case where a size of zero implies the color is not red (nothing). A vector sum might seem like the natural (if not only) thing a network could do to combine these encoding vectors. To appreciate why this claim is worth verifying, it’ll be worth investigating if alternative non-linear functions could also get the job done. Recall that the thing we want is a function that combines these encodings at some component in the model in a way that preserves information for downstream computation. So this is effectively an information compression problem. As discussed in Elhage et al [3a], the following non-linear compression scheme could get the job done: Where we seek to compress values x and y into t. The value of Z is chosen according to the required floating-point precision needed for compressions. Python # A Python Implementation from math import floor def compress_values(x1, x2, precision=1): z = 10 ** precision compressed_val = (floor(z * x1) + x2) / z return round(compressed_val, precision * 2) def recover_values(compressed_val, precision=1): z = 10 ** precision x2_recovered = (compressed_val * z) - floor(compressed_val * z) x1_recovered = compressed_val - (x2_recovered / z) return round(x1_recovered, precision), round(x2_recovered, precision) # Now to compress vectors a and b a = [0.3, 0.6] b = [0.2, 0.8] compressed_a_b = [compress_values(a[0], b[0]), compress_values(a[1], b[1])] # Returned [0.32, 0.68] recovered_a, recovered_b = ( [x, y] for x, y in zip( recover_values(compressed_a_b[0]), recover_values(compressed_a_b[1]) ) ) # Returned ([0.3, 0.6], [0.2, 0.8]) assert all([recovered_a == a, recovered_b == b]) As demonstrated, we’re able to compress and recover vectors a and b just fine, so this is also a viable way of compressing information for later computation using non-linearities like the floor() function that neural networks can approximate. While this seems a little more tedious than just adding vectors, it shows the network does have options. This calls for some evidence and further arguments in support of linearity. Evidence of Linearity The often-cited example of a model component exhibiting strong linearity is the embedding layer in language models [4], where relationships like the following exist between representations of words: This example would hint at the following relationship between the quality of $plurality$ in the input words and the rest of their representation: Okay, so that’s some evidence for one component in a type of neural network having linear representations. The broad outline of arguments for this being prevalent across networks is that linear representations are both the more natural and performant [3b][3a] option for neural networks to settle on. How Important for Interpretability Is It That This Is True? If non-linear compression is prevalent across networks, there are two alternative regimes in which networks could operate: Computation is still mostly done on linear variables: In this regime, while the information is encoded and moved between components non-linearly, the model components would still decompress the representations to run linear computations. From an interpretability standpoint, while this needs some additional work to reverse engineer the decompression operation, this wouldn't pose too high a barrier.Figure 8:Non-linear compression and propagation intervened by linear computation Computation is done in a non-linear state: The model figures out a way to do computations directly on the non-linear representation. This would pose a challenge needing new interpretability methods. However, based on arguments discussed earlier about model architecture affordances this is expected to be unlikely. Figure 9: Direct non-linear computation Features As promised in the introduction, after avoiding the word “feature” this far into the post, we’ll introduce it properly. As a quick aside, I think the engagement of the research community on the topic of defining what we mean when we use the word “feature” is one of the things that makes mech interp, as a pre-paradigmatic science, exciting. While different definitions have been proposed [3c] and the final verdict is by no means out, in this post and others to come on mech interp, I’ll be using the following: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." We’ve already discussed the idea of networks necessarily encoding discrete qualities of inputs, so the most interesting part of the definition is, “...would dedicate a neuron to if it could.” What Is Meant by “...Dedicate a Neuron To...”? In a case where all quality-encoding directions are unique one-hot vectors in activation space ([0, 1] and [1, 0], for example) the neurons are said to be basis-aligned; i.e., one neuron’s activation in the network independently represents the intensity of one input quality. Figure 10: Example of a representation with basis-aligned neurons Note that while sufficient, this property is not necessary for lossless compression of encodings with vector addition. The core requirement is that these feature directions be orthogonal. The reason for this is the same as when we explored the non-linear compression method: we want to completely recover each encoded feature downstream. Basis Vectors Following the Linearity hypothesis, we expect the activation vector to be a sum of all the scaled feature directions: Given an activation vector (which is what we can directly observe when our network fires), if we want to know the activation intensity of some feature in the input, all we need is the feature’s unit vector, feature^j_d: (where the character “.” in the following expression is the vector dot product.) If all the feature unit vectors of that network component (making up the set, Features_d) are orthogonal to each other: And, for any vector: These simplify our equation to give an expression for our feature intensity feature^j_i: Allowing us to fully recover our compressed feature: All that was to establish the ideal property of orthogonality between feature directions. This means even though the idea of “one neuron firing by x-much == one feature is present by x-much” is pretty convenient to think about, there are other equally performant feature directions that don’t have their neuron-firing patterns aligning this cleanly with feature patterns. (As an aside, it turns out basis-aligned neurons don’t happen that often. [3d]) Fig 11: Orthogonal feature directions from non-basis-aligned neurons With this context, the request: ”dedicate a neuron to…” might seem arbitrarily specific. Perhaps “dedicate an extra orthogonal direction vector” would be sufficient to accommodate an additional quality. But as you probably already know, orthogonal vectors in a space don’t grow on trees. A 2-dimensional space can only have 2 orthogonal vectors at a time, for example. So to make more room, we might need an extra dimension, i.e [X X] -> [X X X] which is tantamount to having an extra neuron dedicated to this feature. How Are These Features Stored in Neural Networks? To touch grass quickly, what does it mean when a model component has learned 3 orthogonal feature directions {[1 0 0], [0 1 0], [0 0 1]} for compressing an input vector [a b c]? To get the compressed activation vector, we expect a series of dot products with each feature direction to get our feature scale. Now we just have to sum up our scaled-up feature directions to get our “compressed” activation state. In this toy example, the features are just the vector values so lossless decompressing gets us what we started with. The question is: what does this look like in a model? The above sequence of transformations of dot products followed by a sum is equivalent to the operations of the deep learning workhorse: matrix multiplication. The earlier sentence, “…a model component has learned 3 orthogonal feature directions,” should have been a giveaway. Models store their learnings in weights, and so our feature vectors are just the rows of this layer’s learned weight matrix, W. Why didn’t I just say the whole time, “Matrix multiplication. End of section.” Because we don’t always have toy problems in the real world. The learned features aren’t always stored in just one set of weights. It could (and usually does) involve an arbitrarily long sequence of linear and non-linear compositions to arrive at some feature direction (but the key insight of decompositional linearity is that this computation can be summarised by a direction used to compose some activation). The promise of linearity we discussed only has to do with how feature representations are composed. For example, some arbitrary vector is more likely to not be hanging around for discovery by just reading one row of a layer’s weight matrix, but the computation to encode that feature is spread across several weights and model components. So we had to address features as arbitrary strange directions in activation space because they often are. This point brings the proposed dichotomy between representation and algorithmic interpretability into question. Back to our working definition of features: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." On the Conditional Clause: “…Would Dedicate a Neuron to if It Could...” You can think of this definition of a feature as a bit of a set-up for an introduction to a hypothesis that addresses its counterfactual: What happens when a neural network cannot provide all its input qualities with dedicated neurons? Superposition Thus far, our model has done fine on the task that required it to compress and propagate 2 learned features — “size” and “red-ness” — through a 2-dimensional layer. What happens when a new task requires the compression and propagation of an additional feature like the x-displacement of the center of the square? Figure 12 This shows our network with a new task, requiring it to propagate one more learned property of the input: center x-displacement. We’ve returned to using neuron-aligned bases for convenience. Before we go further with this toy model, it would be worth thinking through if there are analogs of this in large real-world models. Let’s take the large language model GPT2 small [5]. Do you think, if you had all week, you could think of up to 769 useful features of an arbitrary 700-word query that would help predict the next token (e.g., “is a formal letter," “contains how many verbs," “is about about ‘Chinua Achebe,’” etc.)? Even if we ignored the fact that feature discovery was one of the known superpowers of neural networks [c] and assumed GPT2-small would also end up with only 769 useful input features to encode, we’d have a situation much like our toy problem above. This is because GPT2 has —at the narrowest point in its architecture— only 768 neurons to work with, just like our toy problem has 2 neurons but needs to encode information about 3 features. [d] So this whole “model component encodes more features it has neurons” business should be worth looking into. It probably also needs a shorter name. That name is the Superposition hypothesis. Considering the above thought experiment with GPT2 Small, it would seem this hypothesis is just stating the obvious- that models are somehow able to represent more input qualities (features) than they have dimensions for. What Exactly Is Hypothetical About Superposition? There’s a reason I introduced it this late in the post: it depends on other abstractions that aren't necessarily self-evident. The most important is the prior formulation of features. It assumes linear decomposition- the expression of neural net representations as sums of scaled directions representing discrete qualities of their inputs. These definitions might seem circular, but they’re not if defined sequentially: If you conceive of neural networks as encoding discrete information of inputs called Features as directions in activation space, then when we suspect the model has more of these features than it has neurons, we call this Superposition. A Way Forward As we’ve seen, it would be convenient if the features of a model were aligned with neurons and necessary for them to be orthogonal vectors to allow lossless recovery from compressed representations. So to suggest this isn't happening poses difficulties to interpretation and raises questions on how networks can pull this off anyway. Further development of the hypothesis provides a model for thinking about why and how superposition happens, clearly exposes the phenomenon in toy problems, and develops promising methods for working around barriers to interpretability [6]. More on this in a future post. Footnotes [a] That is, algorithms more descriptive than “Take this Neural Net architecture and fill in its weights with these values, then do a forward pass.” [b] Primarily from ideas introduced in Toy Models of Superposition [c] This refers specifically to the codification of features as their superpower. Humans are pretty good at predicting the next token in human text; we’re just not good at writing programs for extracting and representing this information vector space. All of that is hidden away in the mechanics of our cognition. [d] Technically, the number to compare the 768-dimension residual stream width to is the maximum number of features we think *any* single layer would have to deal with at a time. If we assume equal computational workload between layers and assume each batch of features was built based on computations on the previous, for the 12-layer GPT2 model, this would be 12 * 768 = 9,216 features you’d need to think up. References [1] Chris Olah on Mech Interp - 80000 Hours [2] Interpretability Dreams [3] Toy Models of Superposition [3a] Nonlinear Compression [3b] Features as Directions [3c] What are Features? [3d] Definitions and Motivation [4] Linguistic regularities in continuous space word representations: Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, pp. 746--751. [5] Language Models are Unsupervised Multitask Learners: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever [6] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Year by year, artificial intelligence evolves and becomes more efficient for solving everyday human tasks. But at the same time, it increases the possibility of personal information misuse, reaching unprecedented levels of power and speed in analyzing and spreading individuals' data. In this article, I would like to take a closer look at the strong connection between artificial intelligence systems and machine learning and their use of increasingly private and sensitive data. Together, we'll explore existing privacy risks, discuss traditional approaches to privacy in machine learning, and analyze ways to overcome security breaches. Importance of Privacy in AI It is no secret that today, AI is extensively used in many spheres, including marketing. NLP, or Natural Language Processing, interprets human language and is used in voice assistants and chatbots, understanding accents and emotions; it links social media content to engagement. Machine learning employs algorithms to analyze data, improve performance, and enable AI to make decisions without human intervention. Deep Learning relies on neural networks and uses extensive datasets for informed choices. These AI types often collaborate, posing challenges to data privacy. AI collects data intentionally, where users provide information, or unintentionally, for instance, through facial recognition. The problem arises when unintentional data collection leads to unexpected uses, compromising privacy. For example, discussing pet food or more intimate purchases around a phone can lead to targeted ads, revealing unintentional data gathering. AI algorithms, while being intelligent, may inadvertently capture information and subject it to unauthorized use. Thus, video doorbells with facial identification intended for family recognition may unintentionally collect data about unrelated individuals, causing neighbors to worry about surveillance and data access. Bearing in mind the above, it is crucially important to establish a framework for ethical decision-making regarding the use of new AI technologies. Addressing privacy challenges and contemplating the ethics of technology is imperative for the enduring success of AI. One of the main reasons for that is that finding a balance between technological innovation and privacy concerns will foster the development of socially responsible AI, contributing to the long-term creation of public value and private security. Traditional Approach Risks Before we proceed with efficient privacy-preserving techniques, let us take a look at traditional approaches and the problems they may face. Traditional approaches to privacy and machine learning are centered mainly around two concepts: user control and data protection. Users want to know who collects their data, for what purpose, and how long it will be stored. Data protection involves anonymized and encrypted data, but even here, the gaps are inevitable, especially in machine learning, where decryption is often necessary. Another issue is that machine learning involves multiple stakeholders, creating a complex web of trust. Trust is crucial when sharing digital assets, such as training data, inference data, and machine learning models across different entities. Just imagine that there is an entity that owns the training data, while another set of entities may own the inference data. The third entity provides a machine learning server running on the inference, performed by a model owned by someone else. Additionally, it operates on infrastructure from an extensive supply chain involving many parties. Due to this, all the entities must demonstrate trust in each other within a complex chain. Managing this web of trust becomes increasingly difficult. Examples of Security Breaches As we rely more on communication technologies using machine learning, the chance of data breaches and unauthorized access goes up. Hackers might try to take advantage of vulnerabilities in these systems to get hold of personal data, such as name, address, and financial information, which can result in fund losses and identity theft. A report on the malicious use of AI outlines three areas of security concern: expansion of existing threats, new attack methods, and changes in the typical character of threats. Examples of malicious AI use include BEC attacks using deepfake technology, contributing to social engineering tactics. AI-assisted cyber-attacks, demonstrated by IBM's DeepLocker, show how AI can enhance ransomware attacks by making decisions based on trends and patterns. Notably, TaskRabbit experienced an AI-assisted cyber-attack, where an AI-enabled botnet executed a DDoS attack, leading to a data breach which affected 3.75 million customers. Moreover, increased online shopping is fueling card-not-present (CNP) fraud, combined with rising synthetic identity and identity theft issues. Predicted losses from it could reach $200 billion by 2024, with transaction volumes rising over 23%. Privacy-Preserving Machine Learning This is when privacy-preserving machine learning comes in with a solution. Among the most effective techniques are federated learning, homomorphic encryption, and differential privacy. Federated learning allows separate entities to collectively train a model without sharing explicit data. In turn, homomorphic encryption enables machine learning on encrypted data throughout the process and differential privacy ensures that calculation outputs cannot be tied to individual data presence. These techniques, combined with trusted execution environments, can effectively address the challenges at the intersection of privacy and machine learning. Privacy Advantages of Federated Learning As you can see, classical machine learning models lack the efficiency to implement AI systems and IoT practices securely when compared to privacy-preserving machine learning techniques, particularly federated learning. Being a decentralized version of machine learning, FL helps make AI security-preserving techniques more reliable. In traditional methods, sensitive user data is sent to centralized servers for training, posing numerous privacy concerns, and federated learning addresses this by allowing models to be trained locally on devices, ensuring user data security. Enhanced Data Privacy and Security Federated learning, with its collaborative nature, treats each IoT device on the edge as a unique client, training models without transmitting raw data. This ensures that during the federated learning process, each IoT device only gathers the necessary information for its task. By keeping raw data on the device and sending only model updates to the central server, federated learning safeguards private information, minimizes the risk of personal data leakage, and ensures secure operations. Improved Data Accuracy and Diversity Another important issue is that centralized data used to train a model may not accurately represent the full spectrum of data that the model will encounter. In contrast, training models on decentralized data from various sources and exposing them to a broader range of information enhances the model's ability to generalize new data, handle variations, and reduce bias. Higher Adaptability One more advantage federated learning models exhibit is a notable capability to adapt to new situations without requiring retraining, which provides extra security and reliability. Using insights from previous experiences, these models can make predictions and apply knowledge gained in one field to another. For instance, if the model becomes more proficient in predicting outcomes in a specific domain, it can seamlessly apply this knowledge to another field, enhancing efficiency, reducing costs, and expediting processes. Encryption Techniques To enhance privacy in FL, even more efficient encryption techniques are often used. Among them are homomorphic encryption and secure multi-party computation. These methods ensure that data stays encrypted and secure during communication and model aggregation. The homomorphic encryption allows computations on encrypted data without decryption. For example, if a user wants to upload data to a cloud-based server, they can encrypt it, turning it into ciphertext, and only after that upload it. The server would then process that data without decrypting it, and then the user would get it back. After that, the user would decrypt it with their secret key. Multi-party computation, or MPC, enables multiple parties, each with their private data, to evaluate a computation without revealing any of the private data held by each party. A multi-party computation protocol ensures both privacy and accuracy. The private information held by the parties cannot be inferred from the execution of the protocol. If any party within the group decides to share information or deviates from the instructions during the protocol execution, the MPC will not allow it to force the other parties to output an incorrect result or leak any private information. Final Considerations Instead of the conclusion, I would like to stress the importance and urgency of embracing advanced security approaches in ML. For effective and long-term outcomes in AI safety and security, there should be coordinated efforts between the AI development community and legal and policy institutions. Building trust and establishing proactive channels for collaboration in developing norms, ethics, standards, and laws is crucial to avoid reactive and potentially ineffective responses from both the technical and policy sectors. I would also like to quote the authors of the report mentioned above, who propose the following recommendations to face security challenges in AI: Policymakers should collaborate closely with technical researchers to explore, prevent, and mitigate potential malicious applications of AI. AI researchers and engineers should recognize the dual-use nature of their work, considering the potential for misuse and allowing such considerations to influence research priorities and norms. They should also proactively engage with relevant stakeholders when harmful applications are foreseeable. Identify best practices from mature research areas, like computer security, and apply them to address dual-use concerns in AI. Actively work towards broadening the involvement of stakeholders and domain experts in discussions addressing these challenges. Hope this article encourages you to investigate the topic on your own, contributing to a more secure digital world.
The NIST AI RMF (National Institute of Standards and Technology Artificial Intelligence Risk Management Framework) provides a structured framework for identifying, assessing, and mitigating risks associated with artificial intelligence technologies, addressing complex challenges such as algorithmic bias, data privacy, and ethical considerations, thus helping organizations ensure the security, reliability, and ethical use of AI systems. How Do AI Risks Differ From Traditional Software Risks? AI risks differ from traditional software risks in several key ways: Complexity: AI systems often involve complex algorithms, machine learning models, and large datasets, which can introduce new and unpredictable risks. Algorithmic bias: AI systems can exhibit bias or discrimination based on factors such as the training data used to develop the models. This can result in unintended outcomes and consequences, which may not be part of traditional software systems. Opacity and lack of interpretability: AI algorithms, particularly deep learning models, can be opaque and difficult to interpret. This can make it challenging to understand how AI systems make decisions or predictions, leading to risks related to accountability, transparency, and trust. Data quality and bias: AI systems rely heavily on data, and issues such as data quality, incompleteness, and bias can significantly impact their performance and reliability. Traditional software may also rely on data, but the implications of data quality issues may be more noticeable in AI systems, affecting the accuracy, and effectiveness of AI-driven decisions. Adversarial attacks: AI systems may be vulnerable to adversarial attacks, where malicious actors manipulate inputs to deceive or manipulate the system's behavior. Adversarial attacks exploit vulnerabilities in AI algorithms and can lead to security breaches, posing distinct risks compared to traditional software security threats. Ethical and societal implications: AI technologies raise ethical and societal concerns that may not be as prevalent in traditional software systems. These concerns include issues such as privacy violations, job displacement, loss of autonomy, and reinforcement of biases. Regulatory and compliance challenges: AI technologies are subject to a rapidly evolving regulatory landscape, with new laws and regulations emerging to address AI-specific risks and challenges. Traditional software may be subject to similar regulations, but AI technologies often raise novel compliance issues related to fairness, accountability, transparency, and bias mitigation. Cost: The expense associated with managing an AI system exceeds that of regular software, as it often requires ongoing tuning to align with the latest models, training, and self-updating processes. Effectively managing AI risks requires specialized knowledge, tools, and frameworks tailored to the unique characteristics of AI technologies and their potential impact on individuals, organizations, and society as a whole. Key Considerations of the AI RMF The AI RMF refers to an AI system as an engineered or machine-based system that can, for a given set of objectives, generate outputs such as predictions, recommendations, or decisions influencing real or virtual environments. The AI RMF helps organizations effectively identify, assess, mitigate, and monitor risks associated with AI technologies throughout the lifecycle. It addresses various challenges, like data quality issues, model bias, adversarial attacks, algorithmic transparency, and ethical considerations. Key considerations include: Risk identification Risk assessment and prioritization Control selection and tailoring Implementation and integration Monitoring and evaluation Ethical and social implications Interdisciplinary collaboration Key Functions of the Framework Following are the essential functions within the NIST AI RMF that help organizations effectively identify, assess, mitigate, and monitor risks associated with AI technologies. Image courtesy of NIST AI RMF Playbook Govern Governance in the NIST AI RMF refers to the establishment of policies, processes, structures, and mechanisms to ensure effective oversight, accountability, and decision-making related to AI risk management. This includes defining roles and responsibilities, setting risk tolerance levels, establishing policies and procedures, and ensuring compliance with regulatory requirements and organizational objectives. Governance ensures that AI risk management activities are aligned with organizational priorities, stakeholder expectations, and ethical standards. Map Mapping in the NIST AI RMF involves identifying and categorizing AI-related risks, threats, vulnerabilities, and controls within the context of the organization's AI ecosystem. This includes mapping AI system components, interfaces, data flows, dependencies, and associated risks to understand the broader risk landscape. Mapping helps organizations visualize and prioritize AI-related risks, enabling them to develop targeted risk management strategies and allocate resources effectively. It may also involve mapping AI risks to established frameworks, standards, or regulations to ensure comprehensive coverage and compliance. Measurement Measurement in the NIST AI RMF involves assessing and quantifying AI-related risks, controls, and performance metrics to evaluate the effectiveness of risk management efforts. This includes conducting risk assessments, control evaluations, and performance monitoring activities to measure the impact of AI risks on organizational objectives and stakeholder interests. Measurement helps organizations identify areas for improvement, track progress over time, and demonstrate the effectiveness of AI risk management practices to stakeholders. It may also involve benchmarking against industry standards or best practices to identify areas for improvement and drive continuous improvement. Manage Management in the NIST AI RMF refers to the implementation of risk management strategies, controls, and mitigation measures to address identified AI-related risks effectively. This includes implementing selected controls, developing risk treatment plans, and monitoring AI systems' security posture and performance. Management activities involve coordinating cross-functional teams, communicating with stakeholders, and adapting risk management practices based on changing risk environments. Effective risk management helps organizations minimize the impact of AI risks on organizational objectives, stakeholders, and operations while maximizing the benefits of AI technologies. Key Components of the Framework The NIST AI RMF consists of two primary components: Foundational Information This part includes introductory materials, background information, and context-setting elements that provide an overview of the framework's purpose, scope, and objectives. It may include definitions, principles, and guiding principles relevant to managing risks associated with artificial intelligence (AI) technologies. Core and Profiles This part comprises the core set of processes, activities, and tasks necessary for managing AI-related risks, along with customizable profiles that organizations can tailor to their specific needs and requirements. The core provides a foundation for risk management, while profiles allow organizations to adapt the framework to their unique circumstances, addressing industry-specific challenges, regulatory requirements, and organizational priorities. Significance of AI RMF Based on Roles Benefits for Developers Guidance on risk management: The AI RMF provides developers with structured guidance on identifying, assessing, mitigating, and monitoring risks associated with AI technologies. Compliance with standards and regulations: The AI RMF helps developers ensure compliance with relevant standards, regulations, and best practices governing AI technologies. By referencing established NIST guidelines, such as NIST SP 800-53, developers can identify applicable security and privacy controls for AI systems. Enhanced security and privacy: By incorporating security and privacy controls recommended in the AI RMF, developers can mitigate the risks of data breaches, unauthorized access, and other security threats associated with AI systems. Risk awareness and mitigation: The AI RMF raises developers' awareness of potential risks and vulnerabilities inherent in AI technologies, such as data quality issues, model bias, adversarial attacks, and algorithmic transparency. Cross-disciplinary collaboration: The AI RMF emphasizes the importance of interdisciplinary collaboration between developers, cybersecurity experts, data scientists, ethicists, legal professionals, and other stakeholders in managing AI-related risks. Quality assurance and testing: The AI RMF encourages developers to incorporate risk management principles into the testing and validation processes for AI systems. Benefits for Architects Designing secure and resilient systems: Architects play a crucial role in designing the architecture of AI systems. By incorporating principles and guidelines from the AI RMF into the system architecture, architects can design AI systems that are secure, resilient, and able to effectively manage risks associated with AI technologies. This includes designing robust data pipelines, implementing secure APIs, and integrating appropriate security controls to mitigate potential vulnerabilities. Ensuring compliance and governance: Architects are responsible for ensuring that AI systems comply with relevant regulations, standards, and organizational policies. By integrating compliance requirements into the system architecture, architects can ensure that AI systems adhere to legal and ethical standards while protecting sensitive information and user privacy. Addressing ethical and societal implications: Architects need to consider the ethical and societal implications of AI technologies when designing system architectures. Architects can leverage the AI RMF to incorporate mechanisms for ethical decision-making, algorithmic transparency, and user consent into the system architecture, ensuring that AI systems are developed and deployed responsibly. Supporting continuous improvement: The AI RMF promotes a culture of continuous improvement in AI risk management practices. Architects can leverage the AI RMF to establish mechanisms for monitoring and evaluating the security posture and performance of AI systems over time. Comparison of AI Risk Frameworks Framework Strengths Weaknesses NIST AI RMF Comprehensive coverage of AI-specific risks Integration with established NIST cybersecurity guidelines Interdisciplinary approach Alignment with regulatory requirements Emphasis on continuous improvement May require customization to address specific organizational needs Focus on the US-centric regulatory landscape ISO/IEC 27090 Widely recognized international standards ISO/IEC 27090 is designed to integrate seamlessly with ISO/IEC 27001, the international standard for information security management systems (ISMS). Provides comprehensive guidance on managing risks associated with AI technologies The standard follows a structured approach, incorporating the Plan-Do-Check-Act (PDCA) cycle. Lack of specificity in certain areas, as it aims to provide general guidance applicable to a wide range of organizations and industries Implementing ISO/IEC 27090 can be complex, particularly for organizations that are new to information security management or AI risk management. The standard's comprehensive nature and technical language may require significant expertise and resources to understand and implement effectively. IEEE P7006 Focus on data protection considerations in AI systems, particularly those related to personal data Comprehensive guidelines for ensuring privacy, fairness, transparency, and accountability Limited scope to personal data protection May not cover all aspects of AI risk management Fairness, Accountability, and Transparency (FAT) Framework Emphasis on ethical dimensions of AI, including fairness, accountability, transparency, and explainability Provides guidelines for evaluating and mitigating ethical risks Not a comprehensive risk management framework May lack detailed guidance on technical security controls IBM AI Governance Framework Focus on governance aspects of AI projects Covers various aspects of AI lifecycle, including data management, model development, deployment, and monitoring Emphasis on transparency, fairness, and trustworthiness Developed by a specific vendor and may be perceived as biased May not fully address regulatory requirements beyond IBM's scope Google AI Principles Clear principles for ethical AI development and deployment Emphasis on fairness, privacy, accountability, and societal impact Provides guidance for responsible AI practices Not a comprehensive risk management framework Lacks detailed implementation guidance AI Ethics Guidelines from Industry Consortia Developed by diverse stakeholders, including industry, academia, and civil society Provides a broad perspective on ethical AI considerations Emphasis on collaboration and knowledge sharing Not comprehensive risk management frameworks May lack detailed implementation guidance Conclusion The NIST AI Risk Management Framework offers a comprehensive approach to addressing the complex challenges associated with managing risks in artificial intelligence (AI) technologies. Through its foundational information and core components, the framework provides organizations with a structured and adaptable methodology for identifying, assessing, mitigating, and monitoring risks throughout the AI lifecycle. By leveraging the principles and guidelines outlined in the framework, organizations can enhance the security, reliability, and ethical use of AI systems while ensuring compliance with regulatory requirements and stakeholder expectations. However, it's essential to recognize that effectively managing AI-related risks requires ongoing diligence, collaboration, and adaptation to evolving technological and regulatory landscapes. By embracing the NIST AI RMF as a guiding framework, organizations can navigate the complexities of AI risk management with confidence and responsibility, ultimately fostering trust and innovation in the responsible deployment of AI technologies.
In the first part of this series, we introduced the basics of brain-computer interfaces (BCIs) and how Java can be employed in developing BCI applications. In this second part, let's delve deeper into advanced concepts and explore a real-world example of a BCI application using NeuroSky's MindWave Mobile headset and their Java SDK. Advanced Concepts in BCI Development Motor Imagery Classification: This involves the mental rehearsal of physical actions without actual execution. Advanced machine learning algorithms like deep learning models can significantly improve classification accuracy. Event-Related Potentials (ERPs): ERPs are specific patterns in brain signals that occur in response to particular events or stimuli. Developing BCI applications that exploit ERPs requires sophisticated signal processing techniques and accurate event detection algorithms. Hybrid BCI Systems: Hybrid BCI systems combine multiple signal acquisition methods or integrate BCIs with other physiological signals (like eye tracking or electromyography). Developing such systems requires expertise in multiple signal acquisition and processing techniques, as well as efficient integration of different modalities. Real-World BCI Example Developing a Java Application With NeuroSky's MindWave Mobile NeuroSky's MindWave Mobile is an EEG headset that measures brainwave signals and provides raw EEG data. The company provides a Java-based SDK called ThinkGear Connector (TGC), enabling developers to create custom applications that can receive and process the brainwave data. Step-by-Step Guide to Developing a Basic BCI Application Using the MindWave Mobile and TGC Establish Connection: Use the TGC's API to connect your Java application with the MindWave Mobile device over Bluetooth. The TGC provides straightforward methods for establishing and managing this connection. Java ThinkGearSocket neuroSocket = new ThinkGearSocket(this); neuroSocket.start(); Acquire Data: Once connected, your application will start receiving raw EEG data from the device. This data includes information about different types of brainwaves (e.g., alpha, beta, gamma), as well as attention and meditation levels. Java public void onRawDataReceived(int rawData) { // Process raw data } Process Data: Use signal processing techniques to filter out noise and extract useful features from the raw data. The TGC provides built-in methods for some basic processing tasks, but you may need to implement additional processing depending on your application's needs. Java public void onEEGPowerReceived(EEGPower eegPower) { // Process EEG power data } Interpret Data: Determine the user's mental state or intent based on the processed data. This could involve setting threshold levels for certain values or using machine learning algorithms to classify the data. For example, a high attention level might be interpreted as the user wanting to move a cursor on the screen. Java public void onAttentionReceived(int attention) { // Interpret attention data } Perform Action: Based on the interpretation of the data, have your application perform a specific action. This could be anything from moving a cursor, controlling a game character, or adjusting the difficulty level of a task. Java if (attention > ATTENTION_THRESHOLD) { // Perform action } Improving BCI Performance With Java Optimize Signal Processing: Enhance the quality of acquired brain signals by implementing advanced signal processing techniques, such as adaptive filtering or blind source separation. Employ Advanced Machine Learning Algorithms: Utilize state-of-the-art machine learning models, such as deep neural networks or ensemble methods, to improve classification accuracy and reduce user training time. Libraries like DeepLearning4j or TensorFlow Java can be employed for this purpose. Personalize BCI Models: Customize BCI models for individual users by incorporating user-specific features or adapting the model parameters during operation. This can be achieved using techniques like transfer learning or online learning. Implement Efficient Real-Time Processing: Ensure that your BCI application can process brain signals and generate output commands in real time. Optimize your code, use parallel processing techniques, and leverage Java's concurrency features to achieve low-latency performance. Evaluate and Validate Your BCI Application: Thoroughly test your BCI application on a diverse group of users and under various conditions to ensure its reliability and usability. Employ standard evaluation metrics and follow best practices for BCI validation. Conclusion Advanced BCI applications require a deep understanding of brain signal acquisition, processing, and classification techniques. Java, with its extensive libraries and robust performance, is an excellent choice for implementing such applications. By exploring advanced concepts, developing real-world examples, and continuously improving BCI performance, developers can contribute significantly to this revolutionary field.
Google BigQuery is a powerful cloud-based data warehousing solution that enables users to analyze massive datasets quickly and efficiently. In Python, BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, allowing developers to leverage familiar tools and syntax for data querying and manipulation. In this comprehensive developer guide, we'll explore the usage of BigQuery DataFrames, their advantages, disadvantages, and potential performance issues. Introduction To BigQuery DataFrames BigQuery DataFrames serve as a bridge between Google BigQuery and Python, allowing seamless integration of BigQuery datasets into Python workflows. With BigQuery DataFrames, developers can use familiar libraries like Pandas to query, analyze, and manipulate BigQuery data. This Pythonic approach simplifies the development process and enhances productivity for data-driven applications. Advantages of BigQuery DataFrames Pythonic Interface: BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, enabling developers to use familiar Python syntax and libraries. Integration With Pandas: Being compatible with Pandas, BigQuery DataFrames allow developers to leverage the rich functionality of Pandas for data manipulation. Seamless Query Execution: BigQuery DataFrames handle the execution of SQL queries behind the scenes, abstracting away the complexities of query execution. Scalability: Leveraging the power of Google Cloud Platform, BigQuery DataFrames offer scalability to handle large datasets efficiently. Disadvantages of BigQuery DataFrames Limited Functionality: BigQuery DataFrames may lack certain advanced features and functionalities available in native BigQuery SQL. Data Transfer Costs: Transferring data between BigQuery and Python environments may incur data transfer costs, especially for large datasets. API Limitations: While BigQuery DataFrames provide a convenient interface, they may have limitations compared to directly using the BigQuery API for complex operations. Prerequisites Google Cloud Platform (GCP) Account: Ensure an active GCP account with BigQuery access. Python Environment: Set up a Python environment with the required libraries (pandas, pandas_gbq, and google-cloud-bigquery). Project Configuration: Configure your GCP project and authenticate your Python environment with the necessary credentials. Using BigQuery DataFrames Install Required Libraries Install the necessary libraries using pip: Python pip install pandas pandas-gbq google-cloud-bigquery Authenticate GCP Credentials Authenticate your GCP credentials to enable interaction with BigQuery: Python from google.auth import load_credentials # Load GCP credentials credentials, _ = load_credentials() Querying BigQuery DataFrames Use pandas_gbq to execute SQL queries and retrieve results as a DataFrame: Python import pandas_gbq # SQL Query query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id`" # Execute Query and Retrieve DataFrame df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials) Writing to BigQuery Write a DataFrame to a BigQuery table using pandas_gbq: Python # Write DataFrame to BigQuery pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_new_table", project_id="your_project_id", if_exists="replace", credentials=credentials) Advanced Features SQL Parameters Pass parameters to your SQL queries dynamically: Python params = {"param_name": "param_value"} query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id` WHERE column_name = @param_name" df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials, dialect="standard", parameters=params) Schema Customization Customize the DataFrame schema during the write operation: Python schema = [{"name": "column_name", "type": "INTEGER"}, {"name": "another_column", "type": "STRING"}] pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_custom_table", project_id="your_project_id", if_exists="replace", credentials=credentials, table_schema=schema) Performance Considerations Data Volume: Performance may degrade with large datasets, especially when processing and transferring data between BigQuery and Python environments. Query Complexity: Complex SQL queries may lead to longer execution times, impacting overall performance. Network Latency: Network latency between the Python environment and BigQuery servers can affect query execution time, especially for remote connections. Best Practices for Performance Optimization Use Query Filters: Apply filters to SQL queries to reduce the amount of data transferred between BigQuery and Python. Optimize SQL Queries: Write efficient SQL queries to minimize query execution time and reduce resource consumption. Cache Query Results: Cache query results in BigQuery to avoid re-executing queries for repeated requests. Conclusion BigQuery DataFrames offer a convenient and Pythonic way to interact with Google BigQuery, providing developers with flexibility and ease of use. While they offer several advantages, developers should be aware of potential limitations and performance considerations. By following best practices and optimizing query execution, developers can harness the full potential of BigQuery DataFrames for data analysis and manipulation in Python.
Function pipelines allow seamless execution of multiple functions in a sequential manner, where the output of one function serves as the input to the next. This approach helps in breaking down complex tasks into smaller, more manageable steps, making code more modular, readable, and maintainable. Function pipelines are commonly used in functional programming paradigms to transform data through a series of operations. They promote a clean and functional style of coding, emphasizing the composition of functions to achieve desired outcomes. In this article, we will explore the fundamentals of function pipelines in Python, including how to create and use them effectively. We'll discuss techniques for defining pipelines, composing functions, and applying pipelines to real-world scenarios. Creating Function Pipelines in Python In this segment, we'll explore two instances of function pipelines. In the initial example, we'll define three functions—'add', 'multiply', and 'subtract'—each designed to execute a fundamental arithmetic operation as implied by its name. Python def add(x, y): return x + y def multiply(x, y): return x * y def subtract(x, y): return x - y Next, create a pipeline function that takes any number of functions as arguments and returns a new function. This new function applies each function in the pipeline to the input data sequentially. Python # Pipeline takes multiple functions as argument and returns an inner function def pipeline(*funcs): def inner(data): result = data # Iterate thru every function for func in funcs: result = func(result) return result return inner Let’s understand the pipeline function. The pipeline function takes any number of functions (*funcs) as arguments and returns a new function (inner). The inner function accepts a single argument (data) representing the input data to be processed by the function pipeline. Inside the inner function, a loop iterates over each function in the funcs list. For each function func in the funcs list, the inner function applies func to the result variable, which initially holds the input data. The result of each function call becomes the new value of result. After all functions in the pipeline have been applied to the input data, the inner function returns the final result. Next, we create a function called ‘calculation_pipeline’ that passes the ‘add’, ‘multiply’ and ‘substract’ to the pipeline function. Python # Create function pipeline calculation_pipeline = pipeline( lambda x: add(x, 5), lambda x: multiply(x, 2), lambda x: subtract(x, 10) ) Then we can test the function pipeline by passing an input value through the pipeline. Python result = calculation_pipeline(10) print(result) # Output: 20 We can visualize the concept of a function pipeline through a simple diagram. Another example: Python def validate(text): if text is None or not text.strip(): print("String is null or empty") else: return text def remove_special_chars(text): for char in "!@#$%^&*()_+{}[]|\":;'<>?,./": text = text.replace(char, "") return text def capitalize_string(text): return text.upper() # Pipeline takes multiple functions as argument and returns an inner function def pipeline(*funcs): def inner(data): result = data # Iterate thru every function for func in funcs: result = func(result) return result return inner # Create function pipeline str_pipeline = pipeline( lambda x : validate(x), lambda x: remove_special_chars(x), lambda x: capitalize_string(x) ) Testing the pipeline by passing the correct input: Python # Test the function pipeline result = str_pipeline("Test@!!!%#Abcd") print(result) # TESTABCD In case of an empty or null string: Python result = str_pipeline("") print(result) # Error In the example, we've established a pipeline that begins by validating the input to ensure it's not empty. If the input passes this validation, it proceeds to the 'remove_special_chars' function, followed by the 'Capitalize' function. Benefits of Creating Function Pipelines Function pipelines encourage modular code design by breaking down complex tasks into smaller, composable functions. Each function in the pipeline focuses on a specific operation, making it easier to understand and modify the code. By chaining together functions in a sequential manner, function pipelines promote clean and readable code, making it easier for other developers to understand the logic and intent behind the data processing workflow. Function pipelines are flexible and adaptable, allowing developers to easily modify or extend existing pipelines to accommodate changing requirements.
For the next 20 days (don’t ask me why I chose that number), I will be publishing a DynamoDB quick tip per day with code snippets. The examples use the DynamoDB packages from AWS SDK for Go V2 but should be applicable to other languages as well. Day 20: Converting Between Go and DynamoDB Types Posted: 13/Feb/2024 The DynamoDB attributevalue in the AWS SDK for Go package can save you a lot of time, thanks to the Marshal and Unmarshal family of utility functions that can be used to convert between Go types (including structs) and AttributeValues. Here is an example using a Go struct: MarshalMap converts Customer struct into a map[string]types.AttributeValue that's required by PutItem UnmarshalMap converts the map[string]types.AttributeValue returned by GetItem into a Customer struct Go type Customer struct { Email string `dynamodbav:"email"` Age int `dynamodbav:"age,omitempty"` City string `dynamodbav:"city"` } customer := Customer{Email: "abhirockzz@gmail.com", City: "New Delhi"} item, _ := attributevalue.MarshalMap(customer) client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: item, }) resp, _ := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{"email": &types.AttributeValueMemberS{Value: "abhirockzz@gmail.com"}, }) var cust Customer attributevalue.UnmarshalMap(resp.Item, &cust) log.Println("item info:", cust.Email, cust.City) Recommended reading: MarshalMap API doc UnmarshalMap API doc AttributeValue API doc Day 19: PartiQL Batch Operations Posted: 12/Feb/2024 You can use batched operations with PartiQL as well, thanks to BatchExecuteStatement. It allows you to batch reads as well as write requests. Here is an example (note that you cannot mix both reads and writes in a single batch): Go //read statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) //separate batch for write statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }, { Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) Just like BatchWriteItem, BatchExecuteStatement is limited to 25 statements (operations) per batch. Recommended reading: BatchExecuteStatementAPI docs Build faster with Amazon DynamoDB and PartiQL: SQL-compatible operations (thanks, Pete Naylor !) Day 18: Using a SQL-Compatible Query Language Posted: 6/Feb/2024 DynamoDB supports PartiQL to execute SQL-like select, insert, update, and delete operations. Here is an example of how you would use PartiQL-based queries for a simple URL shortener application. Notice how it uses a (generic) ExecuteStatement API to execute INSERT, SELECT, UPDATE and DELETE: Go _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("SELECT * FROM url_metadata where shortcode=? AND active=true"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) Recommended reading: Amazon DynamoDB documentation on PartiQL support ExecuteStatement API docs Day 17: BatchGetItem Operation Posted: 5/Feb/2024 You can club multiple (up to 100) GetItem requests in a single BatchGetItem operation - this can be done across multiple tables. Here is an example that fetches includes four GetItem calls across two different tables: Go resp, err := client.BatchGetItem(context.Background(), &dynamodb.BatchGetItemInput{ RequestItems: map[string]types.KeysAndAttributes{ "customer": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, { "email": &types.AttributeValueMemberS{Value: "c2@foo.com"}, }, }, }, "Thread": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "ForumName": &types.AttributeValueMemberS{Value: "Amazon DynamoDB"}, "Subject": &types.AttributeValueMemberS{Value: "DynamoDB Thread 1"}, }, { "ForumName": &types.AttributeValueMemberS{Value: "Amazon S3"}, "Subject": &types.AttributeValueMemberS{Value: "S3 Thread 1"}, }, }, ProjectionExpression: aws.String("Message"), }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Just like an individual GetItem call, you can include Projection Expressions and return RCUs. Note that BatchGetItem can only retrieve up to 16 MB of data. Recommended reading: BatchGetItem API doc Day 16: Enhancing Write Performance With Batching Posted: 2/Feb/2024 The DynamoDB BatchWriteItem operation can provide a performance boost by allowing you to squeeze in 25 individual PutItem and DeleteItem requests in a single API call — this can be done across multiple tables. Here is an example that combines PutItem and DeleteItem operations for two different tables (customer, orders): Go _, err := client.BatchWriteItem(context.Background(), &dynamodb.BatchWriteItemInput{ RequestItems: map[string][]types.WriteRequest{ "customer": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c3@foo.com"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, }, }, }, "orders": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_1234"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_4321"}, }, }, }, }, }, }) Be aware of the following constraints: The total request size cannot exceed 16 MB BatchWriteItem cannot update items Recommended reading: BatchWriteItem API doc Day 15: Using the DynamoDB Expression Package To Build Update Expressions Posted: 31/Jan/2024 The DynamoDB Go, SDK expression package, supports the programmatic creation of Update expressions. Here is an example of how you can build an expression to include execute a SET operation of the UpdateItem API and combine it with a Condition expression (update criteria): Go updateExpressionBuilder := expression.Set(expression.Name("category"), expression.Value("standard")) conditionExpressionBuilder := expression.AttributeNotExists(expression.Name("account_locked")) expr, _ := expression.NewBuilder(). WithUpdate(updateExpressionBuilder). WithCondition(conditionExpressionBuilder). Build() resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, UpdateExpression: expr.Update(), ConditionExpression: expr.Condition(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), ReturnValues: types.ReturnValueAllOld, }) Recommended reading: WithUpdate method in the package API docs. Day 14: Using the DynamoDB Expression Package To Build Key Condition and Filter Expressions Posted: 30/Jan/2024 You can use expression package in the AWS Go SDK for DynamoDB to programmatically build key condition and filter expressions and use them with Query API. Here is an example: Go keyConditionBuilder := expression.Key("ForumName").Equal(expression.Value("Amazon DynamoDB")) filterExpressionBuilder := expression.Name("Views").GreaterThanEqual(expression.Value(3)) expr, _ := expression.NewBuilder(). WithKeyCondition(keyConditionBuilder). WithFilter(filterExpressionBuilder). Build() _, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String("Thread"), KeyConditionExpression: expr.KeyCondition(), FilterExpression: expr.Filter(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), }) Recommended reading: Key and NameBuilder in the package API docs Day 13: Using the DynamoDB Expression Package To Build Condition Expressions Posted: 25/Jan/2024 Thanks to the expression package in the AWS Go SDK for DynamoDB, you can programmatically build Condition expressions and use them with write operations. Here is an example of the DeleteItem API: Go conditionExpressionBuilder := expression.Name("inactive_days").GreaterThanEqual(expression.Value(20)) conditionExpression, _ := expression.NewBuilder().WithCondition(conditionExpressionBuilder).Build() _, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: conditionExpression.Condition(), ExpressionAttributeNames: conditionExpression.Names(), ExpressionAttributeValues: conditionExpression.Values(), }) Recommended reading: WithCondition method in the package API docs Day 12: Using the DynamoDB Expression Package To Build Projection Expressions Posted: 24/Jan/2024 The expression package in the AWS Go SDK for DynamoDB provides a fluent builder API with types and functions to create expression strings programmatically along with corresponding expression attribute names and values. Here is an example of how you would build a Projection Expression and use it with the GetItem API: Go projectionBuilder := expression.NamesList(expression.Name("first_name"), expression.Name("last_name")) projectionExpression, _ := expression.NewBuilder().WithProjection(projectionBuilder).Build() _, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String("customer"), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, ProjectionExpression: projectionExpression.Projection(), ExpressionAttributeNames: projectionExpression.Names(), }) Recommended reading: expression package API docs. Day 11: Using Pagination With Query API Posted: 22/Jan/2024 The Query API returns the result set size to 1 MB. Use ExclusiveStartKey and LastEvaluatedKey elements to paginate over large result sets. You can also reduce page size by limiting the number of items in the result set with the Limit parameter of the Query operation. Go func paginatedQuery(searchCriteria string, pageSize int32) { currPage := 1 var exclusiveStartKey map[string]types.AttributeValue for { resp, _ := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: searchCriteria}, }, Limit: aws.Int32(pageSize), ExclusiveStartKey: exclusiveStartKey, }) if resp.LastEvaluatedKey == nil { return } currPage++ exclusiveStartKey = resp.LastEvaluatedKey } } Recommended reading: Query Pagination Day 10: Query API With Filter Expression Posted: 19/Jan/2024 With the DynamoDB Query API, you can use Filter Expressions to discard specific query results based on criteria. Note that the filter expression is applied after a Query finishes but before the results are returned. Thus, it has no impact on the RCUs (read capacity units) consumed by the query. Here is an example that filters out forum discussion threads that have less than a specific number of views: Go resp, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), FilterExpression: aws.String("#v >= :num"), ExpressionAttributeNames: map[string]string{ "#v": "Views", }, ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":num": &types.AttributeValueMemberN{Value: numViews}, }, }) Recommended reading: Filter Expressions Day 9: Query API Posted: 18/Jan/2024 The Query API is used to model one-to-many relationships in DynamoDB. You can search for items based on (composite) primary key values using Key Condition Expressions. The value for the partition key attribute is mandatory - the query returns all items with that partition key value. Additionally, you can also provide a sort key attribute and use a comparison operator to refine the search results. With the Query API, you can also: Switch to strongly consistent read (eventual consistent being the default) Use a projection expression to return only some attributes Return the consumed Read Capacity Units (RCU) Here is an example that queries for a specific thread based on the forum name (partition key) and subject (sort key). It only returns the Message attribute: Go resp, err = client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name and Subject = :sub"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":sub": &types.AttributeValueMemberS{Value: subject}, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("Message"), }) Recommended reading: API Documentation Item Collections Key Condition Expressions Composite primary key Day 8: Conditional Delete Operation Posted: 17/Jan/2024 All the DynamoDB write APIs, including DeleteItem support criteria-based (conditional) execution. You can use DeleteItem operation with a condition expression — it must be evaluated to true in order for the operation to succeed. Here is an example that verifies the value of inactive_days attribute: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("inactive_days >= :val"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":val": &types.AttributeValueMemberN{Value: "20"}, }, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { return } else { log.Fatal(err) } } Recommended reading: Conditional deletes documentation Day 7: DeleteItem API Posted: 16/Jan/2024 The DynamoDB DeleteItem API does what it says - delete an item. But it can also: Return the content of the old item (at no additional cost) Return the consumed Write Capacity Units (WCU) Return the item attributes for an operation that failed a condition check (again, no additional cost) Retrieve statistics about item collections, if any, that were affected during the operation Here is an example: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ReturnValues: types.ReturnValueAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) Recommended reading: DeleteItem API doc Day 6: Atomic Counters With UpdateItem Posted: 15/Jan/2024 Need to implement an atomic counter using DynamoDB? If you have a use case that can tolerate over-counting or under-counting (for example, visitor count), use the UpdateItem API. Here is an example that uses the SET operator in an update expression to increment num_logins attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET num_logins = num_logins + :num"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":num": &types.AttributeValueMemberN{ Value: num, }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Note that every invocation of UpdateItem will increment (or decrement) — hence, it is not idempotent. Recommended reading: Atomic Counters Day 5: Avoid Overwrites When Using DynamoDB UpdateItem API Posted: 12/Jan/2024 The UpdateItem API creates a new item or modifies an existing item's attributes. If you want to avoid overwriting an existing attribute, make sure to use the SET operation with if_not_exists function. Here is an example that sets the category of an item only if the item does not already have a category attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET category = if_not_exists(category, :category)"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":category": &types.AttributeValueMemberS{ Value: category, }, }, }) Note that if_not_exists function can only be used in the SET action of an update expression. Recommended reading: DynamoDB documentation Day 4: Conditional UpdateItem Posted: 11/Jan/2024 Conditional operations are helpful in cases when you want a DynamoDB write operation (PutItem, UpdateItem or DeleteItem) to be executed based on certain criteria. To do so, use a condition expression - it must evaluate to true in order for the operation to succeed. Here is an example that demonstrates a conditional UpdateItem operation. It uses the attribute_not_exists function: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET first_name = :fn"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":fn": &types.AttributeValueMemberS{ Value: firstName, }, }, ConditionExpression: aws.String("attribute_not_exists(account_locked)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: ConditionExpressions Day 3: UpdateItem Add-On Benefits Posted: 10/Jan/2024 The DynamoDB UpdateItem operation is quite flexible. In addition to using many types of operations, you can: Use multiple update expressions in a single statement Get the item attributes as they appear before or after they are successfully updated Understand which item attributes failed the condition check (no additional cost) Retrieve the consumed Write Capacity Units (WCU) Here is an example (using AWS Go SDK v2): Go resp, err = client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET last_name = :ln REMOVE category"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":ln": &types.AttributeValueMemberS{ Value: lastName, }, }, ReturnValues: types.ReturnValueAllOld, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, } Recommended reading: UpdateItem API Update Expressions Day 2: GetItem Add-On Benefits Posted: 9/Jan/2024 Did you know that the DynamoDB GetItem operation also gives you the ability to: Switch to strongly consistent read (eventually consistent being the default) Use a projection expression to return only some of the attributes Return the consumed Read Capacity Units (RCU) Here is an example (DynamoDB Go SDK): Go resp, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ //email - partition key "email": &types.AttributeValueMemberS{Value: email}, }, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("first_name, last_name"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: GetItem API doc link Projection expressions Day 1: Conditional PutItem Posted: 8/Jan/2024 The DynamoDB PutItem API overwrites the item in case an item with the same primary key already exists. To avoid (or work around) this behavior, use PutItem with an additional condition. Here is an example that uses the attribute_not_exists function: Go _, err := client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("attribute_not_exists(email)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValues: types.ReturnValueAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { log.Println("failed pre-condition check") return } else { log.Fatal(err) } } With the PutItem operation, you can also: Return the consumed Write Capacity Units (WCU) Get the item attributes as they appeared before (in case they were updated during the operation) Retrieve statistics about item collections, if any, that were modified during the operation Recommended reading: API Documentation Condition Expressions Comparison Functions