How Standard Cognition Builds AI-powered Autonomous Checkout on Computer Vision Data
Standard Cognition develops AI-powered checkout experiences for brick and mortar retailers with its computer vision platform. In this talk, Standard Cognition’s Tushar Dadlani discusses the challenges in delivering a no-touch checkout experience that responds to constantly changing data in real-time.
At Standard Cognition, streaming vision-data is converted into high velocity and variety metadata that requires purpose-built databases for developers to prototype on the data and build AI models quickly. The team faced several challenges in building applications on vision data and ultimately moved away from traditional databases and towards a real-time database in the cloud for the following reasons:
-
Velocity of vision data: Standard employed three data strategies to handle the velocity of vision data: keeping high frequency (~ 500 Hz) data on the store premises, immediately processing low-frequency (~ 5Hz) data in the cloud and streaming medium frequency (~ 50 Hz) data for application development.
-
Variety of vision data: The schema of the metadata changes on a daily basis which resulted in the team selecting databases that could handle frequent schema changes, mixed data types, and complex objects.
-
Developer friendly: Developers needed to rapidly iterate and build data powered applications on production datasets and so Standard selected databases that made the data easily available via REST APIs.
Tushar shares a data stack that handles data at different frequencies, from different sources and is easily accessible to developers for ad-hoc analysis, prototyping, and moving new features into production. Standard uses Rockset, a real-time database in the cloud, for speed and simplicity when developing new features and verifying the accuracy of their AI models.
Speakers
Show Notes
Dhruba Borthakur:
Welcome to today's webinar. I think we have a few people online now, but maybe we'll wait for another minute before we get started. Is that okay with you, Tushar?
Tushar Dadlani:
Yep. That makes sense.
Dhruba Borthakur:
Okay. We'll wait till around maybe 11:01 or 11:02, and then get started on it. Oh, in the meantime, I wanted to point out that people who've already joined, there's a Slack channel that is there. Please join the Slack channel because we can have a Q&A or AMA right after the presentation. Or even when the presentation is going on, you could ask us questions and I'll be very happy to answer them. (silence).
Dhruba Borthakur:
Okay. Let's get started. We have quite a few people on the phone, listening to our webinar. Welcome to today's webinar. I'm very glad to be hosting Tushar. Today we are going to talk about vision and AI systems, and data systems. Our guest, as you all know, is Tushar. He's at Standard Cognition. And Standard Cognition is a cutting edge company that is innovating in the field of AI powered, autonomous checkout systems. We're very excited to hear what Tushar is going to tell us today. I have known Tushar for a long time now, maybe four or five years back. At that time he was building his own startup called Explorer.ai, and he was looking around for data systems to store vision data or video data for fast processing. That was almost maybe four years back. His startup later got acquired by Standard Cognition, and he continued to use Rockset at his new workplace as well.
Dhruba Borthakur:
So today Tushar is going to tell us his story of his association with vision and AI, and data systems, and give us a general landscape of what is the state of affairs for vision and AI, as far as data is concerned. As far as logistics is concerned, please do send us questions on the webinar chat window, or maybe join the Slack channel. We can answer your questions when the talk is going on, or we will also have a Q&A after the presentation. Also, please feel free to ask questions related to Tushar's talk, or AI or vision in general, or also about Rockset tech, or any other Rockset details that you'd like to know. We'll be very happy to chat with you. Welcome, Tushar. Really great to have you here. The floor is all yours.
Tushar Dadlani:
All right. Good morning, good evening, good afternoon, wherever you're joining us from. I really wanted to thank Dhruba for this opportunity to share a little bit about how computer vision and the status quo of production computer system has been evolving over the past couple of years. Just to give a little bit background about myself and Standard Cognition, I started working in the industry, working on low-level systems. And then worked on a little bit, as low as boot loaders and stuff like that. And then I slowly evolved into working in the cloud space. Then more recently, I've been focused on computer vision and AI. So I think one thing which I really appreciated about some of the prior experience and working in AI is how things can go wrong. And anything that can go wrong will go wrong.
Tushar Dadlani:
To tell you a little bit about what Standard Cognition is, Standard Cognition is an autonomous checkout company. So essentially, you walk into a store, pick up whatever you want, and you leave. We automatically charge you for whatever you picked up. Especially given the current global crisis, it's all the more important that you don't want to interact and touch as many surfaces as you ... You just want to touch minimum number of surfaces that are just required for your shopping experience, and not touch way more. That's how I look at some of this technology evolving, but this time I'll ... To give you an overview of what we'll be talking today is we'll be talking about how computer vision technology has evolved over the past couple of decades, some of the challenges in the computer vision space. And then I'll go into a little bit detail about what is a data stack at Standard Cognition, with some of the vision and vision metadata. And then talk a little bit about Rockset and do a quick demo. Then we should have ample time for Q&A after that.
Tushar Dadlani:
The evolution of computer vision technology, I think it started with the invention of the camera, right? If you don't have a camera, you can't recreate human vision. But furthermore, for people from different backgrounds, I think an important element of what computer vision ... The way Wikipedia describes it, it's an interdisciplinary scientific field that begins with how computers gain a high level understanding from digital images or videos. Right? So I think that is a very clear definition where it talks about, it's an interdisciplinary scientific field. But I think one thing which usually gets missed out, if you go 20 years back, computer vision was focused on understanding geometries, understanding on fixes and things like that. But more recently, you hear about deep learning and what deep learning can do to create the vision system, and represent how our human eyes work. But I think an important element that often gets missed out in all of that is how complex human vision is. I'll go to a little bit more detail about that later in the presentation.
Tushar Dadlani:
I think one important element that changed the amount of data being produced from digital data being produced about images is the digital camera. I think the digital camera with the development of the CMOS sensor was really a revolutionary point where we were able to generate more and more digital data than ever before. Right? I think that led to a lot of availability of image data, right? As any deep learning researcher knows that data is the key, and you need a lot of data to actually build a supervised model, I think the sensor really enabled some of that. The other dimension of innovation, which has been evolving all the way since the time of Galileo, is the fact of lenses, right? These are just fundamental optics and how light behaves, and all the sensor is doing is capturing light in different shapes and forms.
Tushar Dadlani:
Last but not the least, one thing which often gets overlooked is also the idea that SD cards, or just storage mediums, have become so relevant and so prevalent in today's age and time that I think storing some of this imagery and data that has been captured, and making it available to a broader audience, I think has been some of the revolutionary components of a digital camera that has allowed for innovation in the area of computer vision. To give you a little background about the evolution of computer vision technology, right? I think the first mass application of some of the computer vision technology, for the global world, was having an iPhone in your pocket, right? So somebody who's not a professional photographer would be able to take really good pictures. I know a lot of photographers don't like the idea that people just use an iPhone and use input settings, and get really good pictures. But I think photography is more than just the quality of pictures and the way you take a picture.
Tushar Dadlani:
But I think what the iPhone enabled where I was able to take pictures from ... Anybody could take a really good picture, right? That looks professional, right? Even though photography has other elements. So I think leveraging the compute on an iPhone with a really powerful lens and a really smart phone factor of the sensor really enabled the iPhone, and is a large part of iPhone success, even today. The data that you generate from your phone is just limited to your phone. Maybe you will share it with your friends and family. You will email it to some people. But I think what ended up happening was, when the cloud started emerging, there were more technology like Facebook photos. Instagram has really enabled us to collect more and more data about images, and share that information with our friends and family. Right? Maybe Dhruba can talk a little bit about how the data at Instagram grew, because he was at Facebook for a long time.
Dhruba Borthakur:
Yeah, that's true. Funny that you mention this. I used to be an engineer at Facebook, and I've built a lot of backend systems. One of the systems was definitely something that was storing a lot of photos. This was many years back, I think 2013 or '14, something around there. And the cloud really made it easy for people to store so many photos. So Facebook, at that time, we used to get 5,000 or 6,000 photo uploads per second. So you can imagine the amount of distributed computing that's needed just to store these datasets. Like you said, the cloud software really made it easy to store a lot of these photos and videos, but it's, I think, only now where more neural networks or AI systems are actually processing these videos, and making intelligence out of it. So maybe you can talk more about, or explain to us how the next phase of AI and vision is progressing these days.
Tushar Dadlani:
I think that was a really good introduction to what data has been generated. Like Dhruba was mentioning, 5,000 photos are being uploaded to Facebook's photo sharing website. Right? And I think if you start thinking about the current, state-of-the-art computer vision products, you have tens of hundreds of cameras collecting data, anywhere between 30 to 60 frames per second, per camera. If you start doing the math, you're like, "Wow, 5,000 images per second is nothing." From just one small system, you're generating 5,000 images per second. And then you multiply it by hundreds and thousands, and millions. If you apply it to the Facebook scale, the math becomes almost too much to do in your head. So I think the technology enablers for computer vision have primarily been to make computer vision products possible. Have you been around four Xs, I think? One of them has been hardware, right? So if you think about hardware, there is the camera itself, and the form factor of the camera, the form factor of the sensor, things like that.
Tushar Dadlani:
Then cloud data storage, right? I think cloud data storage has never been cheaper than right now. I think the total cost of storing images and photos is really cheap. So you can just keep storing infinite amount of data in the cloud. And then faster networks, right? I think the ability to store, to upload some of this information, right? So most of the Internet, as we know today, has been designed towards downloads. Everybody wants to be able to download their data rapidly, like the average consumer. But I think uploads are one of those key innovative things that infrastructure has not necessarily optimized too much for. But I think if you look at the amount of data being generated, you really want to figure out how to upload some of this data really quickly. And then last but not the least, the deep learning algorithms themselves, right?
Tushar Dadlani:
I think convolutional neural networks really opened the way for building different kinds of computer vision models that take deep learning approaches towards building more and more complex tasks, like understanding where are people in an image, and what exactly are the pixel boundaries of such folks. Things like that. So I think with these four elements, you are really able to see production, computer vision, applications evolve. But having said all of that, I think an important thing that usually ... It's still pretty hard, right? I think what I was trying to point out is you need innovation in so many diverse fields of technology to enable innovation in computer vision, right? And as a computer vision company, you usually are a full stack company where you're thinking about all the way things like storage, cameras, all the way up to AI algorithms, and how they productionize these algorithms.
Tushar Dadlani:
I just want to quickly share a video, right? This is from Standard Cognition, so it's one of our videos. I think an important element that is really interesting about cameras and deep neural networks is you can take all these complex, feed off image data, and start generating information as to what exactly are two different people picking up, what are they shopping, what is in their shopping cart. But that is something that you might say that, okay, it's almost a virtual shopping cart, like Amazon.com shopping cart. But I think the big difference here is you can start seeing that the edge cases are way too many, right? So things like, I picked up an item, I gave it to somebody else. I used both my hands to grab an item. Even after all of this, right? And as deep neural networks emit probabilities, they don't emit specifics on what exactly the item is. There is a lot of work to get deterministic behavior from this very complex set of neural networks that work in turn with each other. Right?
Tushar Dadlani:
There's a lot of different areas where I think you can start seeing that the number of edge cases in a production AI system just keep growing very rapidly. And you start needing to think about how you're productionizing these things, and not just like, okay, this is a great POC, but I don't know how to take it, how to productionize it. Right? So I think that when cameras meet deep neural networks, you're not just seeing what is happening on the screen, but you are estimating, what is the data volume? How quickly is this data being generated? What is the quality of the outcome I want to see? Things like that. So it's not just one access that, okay, the system is really smart. There is tons of stuff happening behind the scenes that is almost invisible to you in this video.
Tushar Dadlani:
I just wanted to talk a little bit about computer vision algorithms, right? Computers and algorithms started with a simple classification task like, is this a dog? Is this a cat? Is this a dog? Is this a cat? Google did some research in that area and built classifiers for, is this a dog? Is this a cat? Kind of questions. Right? Moving on, I think in the more recent past, there have been a large number of systems, which do things like object detection, object tracking. Object detection, essentially, just drawing a bounding box around a group of pixels, which means that, okay, if there's a person, act on these X, Y coordinates in an image. And this is a rectangle around that person. Object tracking has been mostly like, okay, this person is here, in this specific frame or image. And where is this person moving in the 2D space of the image, right? And then segmentation is almost the pixel level accuracy like, what is the exact bounds of the person at a pixel level detail?
Tushar Dadlani:
Each of these have different use cases, right? I think when you start getting into more complex understanding of the 3D environment around us, where computer vision is really useful, is you start seeing that, okay, segmentation starts becoming a very important task. But even though a lot of these algorithms have developed over the last couple of years, one of the things that are really challenging is to get them running in near real time. Right? Some of these systems are very hard to even get running at a certain frame rate, right? Tesla has done a bunch of talks about this, on how complex these neural networks and their impacts on each other become. So thinking about all of this from a production standpoint is really important.
Tushar Dadlani:
The last thing I want to talk about is, I think some of the newer areas of research have been around image captioning and video captioning. So as you can see here, there is a young boy brushing his teeth with a toothbrush, right? That is a very complex system to build, from an AI perspective, because now you're combining two domains of AI, which is computer vision and language. That is an area where I think the human mind is able to comprehend language very well, and the human mind is also able to comprehend visual information really well. Right? Computer vision is almost like a toddler who doesn't know how to speak, where, in their mind, they can see, okay, these are different colors, they are different things. This is a boy, this is a girl, things like that. They are learning context. But over time, as they start learning language and vision, they slowly start evolving into, okay, this is what the person is doing.
Tushar Dadlani:
A five-year-old can tell you that this is a boy brushing his teeth with a boot brush, but I think computer vision is in the toddler stage where you can differentiate between different things, but you can't do such advanced thinking like a young boy brushing his teeth with a toothbrush, at the speed at which a human brain can do that. That's on the algorithms side, right? But as you start thinking about this as, okay, these are the algorithms, but I think an important thing that usually we forget, a quality of the human mind is to be able to represent such complex information in a brain which is smaller than a data center. Right? So I think that is a really interesting thing, which AI has not been able to conquer yet. If you talk about AI, you're talking about hundreds and thousands of GPUs for training. You're talking about terabytes and petabytes of data, right? A thing that we have in AI and computer vision, not being able to capture is, how do you reduce the data footprint for getting this quality of analysis? Right?
Tushar Dadlani:
If you think about the data volume, right? You take a 13 megapixel camera. Each frame is about 13 to 15 megabytes of each frame. And then you do like some math and see, okay, at 30 frames per second, we are almost generating 400 megabytes of raw pixel data, right? Per second. And assuming you have a store with about 100 cameras, you start doing the math. You're like, "Okay, the raw volume is really growing very fast for ... Even to the extent that hard drives can handle this." I think a person who's used a DSLR and shortened raw can start relating to this, where if you hold on the shutter, you might be able to capture three, four, five, six frames. But if you go down a sequence of hundreds of frames at a very low exposure, you start seeing that, wow, I can't even capture these images continuously, because what is happening is you can't write that kind of data onto a SD card that fast. Right?
Tushar Dadlani:
And then if you look at all of this into a brick and mortar store, you start thinking, okay, all of this data has been generated in a small physical environment. And the volume of that data is almost 144 terabytes of raw image data. Right? This is like the MVP problem, and this is the MVP data set almost. Right? When you start looking at hundreds and thousands of stores, you start multiplying this data volume and it just becomes unprecedented. So an important thing, when you start thinking about productionizing these systems is, where are the fastest data streams being generated, right? Obviously, the camera is the first place where you're generating so much data. But when you start going further downstream, you have systems like inference systems, which basically take all these streams of images and generate insights or metadata for the images, if you want to call it that. And then based on the idea of generating more vectors, and taking those vectors, and then generating a backend logic around those vectors, and saying, "Okay, as a result of all this complex logic, I have so much data coming so fast, but I can't ..."
Tushar Dadlani:
In the first block here, you can't transfer that amount of data over the network, right? Over the Internet. I think that is very obvious. Then you think about, what are the insights I need to gain, to reduce this volume to be one frame per second, instead of 120 frames per second? At stage two, you are looking at, okay, how do I ... This is also a lot of data to generate a shopping cart, right? If you think of an Amazon.com, you have this catalog, you have a user. You just take five items and these things are your bucket, but this is your shopping cart. And then you check out, and then that's it. Right? But even a company like Amazon start generating so much of data, but then you look at this. For every shopping cart, this is the volume of data that you are generating. It's almost a lot of data that is almost impossible to store and retrieve, and reproduce, right? So you can start seeing that hundreds of cameras generate almost 400 GB of data per second, right.
Tushar Dadlani:
The reason I'm stressing so much on raw data, if you talk to any deep learning researcher, I think the first thing they will tell you is, "I need data in the rawest form as possible and in the highest fidelity as possible." Right? So that I can label this data, so that I can do research, so that I can understand these cases. But the practicality of doing that is almost impossible. The other problem that I think is almost hard to reproduce or mimic is human behavior, right? I think generating the exact same scenario of shopping is almost impossible. And having each and every system doing that in the exact same way is a very hard problem because at that point, you're adding a new domain of modeling noise and modeling physics. You might say, "Okay, synthetic data is great," but synthetic data only gets you so far. And it's almost impossible to reproduce every single edge case with synthetic data because the real world is a very complex beast.
Tushar Dadlani:
As you start building some of these computer vision systems in production, you start appreciating how complex the real world is, and how much of the AI algorithms that we think will take over the world are so far away from even determining what is a boy doing in a simple image, right? I think there are various number of factors that impact synthetic data, and I think what you're seeing here is just few examples. But it can get way more complex. I think the other part, for folks who have a little more data background and are not very familiar with the computer vision space, I think an important thing that you start running into is there is enough volume of one dimensional data, but a lot of data is in the 2D and 3D space. And the 2D and 3D space of data always got to pass some sort of coordinate system.
Tushar Dadlani:
And if you think about a robot, with sensors, it generates data in different coordinate systems. And to make a decision, all the data needs to be in the same coordinate system, because if you have data in different coordinate systems, you need to do some coordinate transformation. And that introduces floating point errors. Each of these stages are something that is so hidden and complex, that as you start productionizing some of these systems, you are like, "Okay, each of the elements of this problem are now a very complex problem, and are areas of research in their respective fields themselves." So I was talking a little bit, previously, about how complex the human visual, sensory perception stack is. I think one of the interesting things about deep learning is it can learn a lot of things from a single image. But when you start combining multiple deep learning systems and saying, "Okay, this model's probability impacts the probability of model B, impacts the probability of model C."
Tushar Dadlani:
"And as a result, I have to make a self-driving car, like the Tesla, view that you can see in terms of view. You can see in this system, you have make a decision whether I should break or that I should accelerate, and how much I should turn. But I think what is very hard to teach an algorithm is common sense, right? As a human, I can say, "I don't need to drive fast and run into the building," but teaching a computer that kind of common sense is very hard, right? A personal example, when I recently moved to the Bay Area, I started driving here. And for me, coming from India, where traffic rules is not as well followed, I was able to still understand that, okay, this is a stop sign, that means I have to stop. Right? So just being able to contextually switch and start learning to drive in a new country was quite ... It took a couple of days. But after that, I was able to drive very easily.
Tushar Dadlani:
But if you switch this and you take a self-driving car that was framed for running in India, and you move it to America, it will pretty much be a disaster. I think with that, I wanted to slowly start thinking about ... These are the algorithmic challenges that I think are very complex, and very hard to solve in themselves. But when you combine that with the volume of data, you are now looking at a very complex problem space where there's a lot of data and there is a lot of complex algorithms operating on this data, right? So I think if you start thinking about a computer vision data stack, you want to give the developers the ability to build and customize models based on production outcomes, to ensure accuracy. Right? I think the key word there is production outcomes, because a lot of times you think about data being, okay, I have a training set, I have a validation set, I have a test set. With those three elements, my model is good enough.
Tushar Dadlani:
But you don't know how good your model is till you actually deploy it in production because as I was talking previously, the real word throws very complex scenarios at you, and the algorithms are taken for a shock at different layers of that stack. So I think the ability to build and customize models based on production outcomes, to improve accuracy is a really key part of building out a data stack. I think the next thing I wanted to talk about is, what are the requirements of the data stack, right? So data exists for every engineer in the company, right? And this could be a machine learning engineer, this could be a deep learning researcher. This could be a front end engineer, this could be a backend engineer. This could be a data scientist, right? I think the thread that connects every individual ... And this could also be all the way to a product manager who wants to understand how the product is doing.
Tushar Dadlani:
The key thing that ties the business towards outcomes is data, right? And I think the ability to handle data well is something that some of the leading tech companies have invested a lot of resources in. I think that is what has caused them to differentiate in the market against other competitors in their space. Right? So I think if you look at this, right? Let's take the first example, right? Handle the volume and velocity of computer vision data. This is actually a systems and infrastructure problem that does not have a very easy solution because if I go to an SRE and tell them, "Okay, I have to ingest 144 TB of data per hour," they are like, "Wait, where did you come with that number from?" Right? Then you think about joining data across multiple streams, right? So you're like, "Okay, I have all this data. I want to make some decisions based on it." At that point, you're starting to think about indexes. You're starting to think about, how do I retrieve this data fast enough so that I can do search aggregations and joins, right? And make the data more available.
Tushar Dadlani:
At this volume of data, you're almost thinking about hiring a whole data engineering team that just focuses on making your indexes more efficient. Right? So once you have some of this data, you're also thinking about, how do I make this data available to backend applications, and take this data, and productionize it as a backend application as well? So then that goes into another domain of, you need a dedicated team to be able to build out ETL pipelines. For the people who don't know, ETL stands for extract, transform, load. So you basically want to take some data and create a pipeline, and generate data assets at every stage of the process. And then I think the data scientists and the product managers really like to dig deeper and to analyze what exactly is happening with the data, and do some ad-hoc analysis, right? This ad-hoc analysis is not only limited to them, but also a deep learning researcher wants to see how is their model performing in production, and they want access to their data quickly.
Tushar Dadlani:
I wanted to talk a little bit about the data stack for computer vision, application development at Standard Cognition. A lot of our data is stored in Google cloud storage. And I think the challenge we ran into, with storing data in Google cloud storage, and then dumping that data to BigQuery is, by the time you get the insights on the data, that ship has already sailed, right? So you really want to take that data and process it, and get into a place where I can generate almost near real-time insights on the data, and not wait for a day or two to be able to actually ingest all this massive volume of data into something like BigQuery. And then once you take this data, you want to be able to expose it to some node JS client, or any front-end application that might want to access this data and present it to the users. With that, I wanted to pass over the mic to Dhruba. He'll talk a little bit about Rockset.
Dhruba Borthakur:
Cool, yeah. Thank you, Tushar. That's a great explanation of how you laid out the landscape of how Rockset fits into backend data processing system. I would like to take maybe a minute or two to talk about, what is Rockset and how does it power your backend system? And then I'll hand over the mic back to you so that you can show us the demo and actually walk us through. So Rockset is a real-time indexing database, which is a cloud service. And the focus for us, again, is to be able to provide millisecond latencies of query latencies. Those queries could be so search aggregations, joins or complex queries on very large datasets. If we could go to the next slide, please, Tushar. Rockset is, again, built with four pillars in mind, and these pillars are essentially there to help application developers move very fast.
Dhruba Borthakur:
Take, for example, the first pillar that is going to show here, converged indexing, which means Rockset automatically builds inverted indexes on those data, on every field of the data. So it's very similar to how you would put data in Elasticsearch, for example. Rockset automatically builds an inverted index. Rockset also automatically builds a columnar index. So instead of you putting data in BigQuery or Redshift, or some other warehouse type of technologies, Rockset already builds these indexes internally. And then it also builds a row index or Postgres type of record fetch so that you can also make those types of queries fast. So the focus for converged indexing is that the developer doesn't have to worry about being a DBA, doesn't have to configure some data systems and figure out how to make queries fast. By default, it is set to improve developer productivity and efficiency.
Dhruba Borthakur:
Similarly, I think in your use case, Tushar, for Standard Cognition, you get data from Google cloud, but Rockset also has connectors to other systems. Like if you have data in Kafka, for example, or if you have data in MongoDB, or some other data systems. Without having to set ETL or scripts, you can just point and click, and data gets ingested and indexed in Rockset. Again, it helps developer productivity because none of your developers at Standard will be waiting for an SRE person to set up this data set for you. We can be self-service and self-sufficient. Similarly, I guess there are some developers there in your company who use JavaScript, maybe, applications, or Python applications, or Notebook applications. So instead of embedding very complex query logic in the application, Rockset has these query Lambdas, which are very much like cloud functions or AWS Lambdas, where you can put all the complexity of the query inside the Lambda, and then it exposes a REST endpoint. And then your applications find it very easy to just fetch this endpoint and get results.
Dhruba Borthakur:
You can also version of these endpoints, you can version these queries. Again, developers like this feature because they think that, I don't have to manage how the queries are being deployed and run. I can use the results of the model on my application directly. And by the end of the day, again, Rockset also has an SQL interface, which basically means that if you're an ad-hoc data scientist, for example, or you want to do ad-hoc queries, then you can just use full feature SQL on this data. You can do complex joins and aggregates, and get the insights that you need when you are developing your model, or when you are just exploring large datasets. Those are the four pillars on which Rockset is built with. I think Tushar probably mentioned how each of these pillars help his developers move fast. Tushar, I'm going to pass the mic back to you now. I also have a few questions from the audience. I can hold on to it maybe, and then repeat the questions when you're done with the demo.
Tushar Dadlani:
Yep. That sounds good. Just give me one ...
Dhruba Borthakur:
The demo is also, again ... I guess you are going to show some data from Rockset, right?
Tushar Dadlani:
Yep. Can you see my screen and my notebook?
Dhruba Borthakur:
Yes.
Tushar Dadlani:
Cool. I just wanted to run through a quick demo of ... So now essentially, we have all this data sitting somewhere in Rockset. Somebody asked me, "Hey, what about building a real-time text prediction system in the store itself?" Right? The first thing that comes to my mind is, okay, I have all this data. As opposed to who can build models, I'm thinking, okay, what are the features that I even want to build, that I want to capture for my training data? And what are those features? Right? Some other data comes from a Google cloud ecosystem. The first thing I'm just doing is exploring some data that people have done into Rockset. I think here, you can see that there is, like I just described, the table. Try to understand how much data I'm talking about here. What is the volume of data? So in this dataset, I think, just to give you a little background, this is a dataset in a specific chunk of time, of people in the store.
Tushar Dadlani:
So the first thing I'm just trying to do is, how many people are in the store at a given point of time? I think that tells me, okay, there are 49 people in the store. And I know that somebody on the team, on the ground told me, "Hey, a theft occurred in this specific case." So I'm now trying to dig deeper on that specific theft case. I'm looking at, okay, if there are a lot of people in the store, does that mean that theft is higher or not? It seems like one of the signals I can consider, but I don't have good statistics around that. Right? So I can just quickly connect to Rockset, start writing some SQL in a notebook, and then start looking at that. Right? So the duration, right? I think I basically was like, "Okay, how long was this person in the store for?" It says, okay, this person was there for four minutes. Four minutes is very quick. Since you are autonomous checkout system, you can just go in and walk out.
Tushar Dadlani:
I think the behavior of a shopper is slightly different from what you might see in a traditional store. You probably spend four minutes just at the cashier line, right? And then now I'm looking at, okay, there was this true positive case, where did this person go around in the store? I think a thing that really helped me here was just hiding SQL and getting the results in our data frame. I think the beauty of this is not the fact that I can run this query and get this data here. It is the fact that I can tweak this and generate much larger datasets with the fast query performance, for me, because otherwise, if I'm backed by BigQuery or I'm backed by HDFS, that total time to get the results itself is so high that I'm like, "Okay, this exploration is not even worth it." Right?
Tushar Dadlani:
And then I basically took the data, converted it to a data frame, plotted where the person visited in the store. So you can see that this is a trajectory of the person in the store, and what is the location heat map of that person. Right? I think an important thing to look at is, okay, this person was in this area 27% of the time. So these are quick statistics I'm trying to gather about what exactly might be good features to rate on. And what are those features that I think might be useful to build this system? I guess, once I have all of this data, I think the next step for me is to generate a training dataset based on all the true positives and true negatives that have been labeled by somebody, or have been told, "Okay, these are all the cases of theft. These are all the cases of non-theft."
Tushar Dadlani:
With that, I can just build a prediction system, and then very rapidly take this, convert it to a query Lambda, and show this to the store manager like, hey, the likelihood of theft in this area is pretty high. You might want to consider moving these items here, or you might want to consider having more staff in that area, or things like that. With that, I just wanted to take it back to the presentation and open the floor for Q&A.
Dhruba Borthakur:
There are some questions about the demo that I see. I can repeat them for you, Tushar.
Tushar Dadlani:
Yep.
Dhruba Borthakur:
One of the question is that, hey, you have data in the demo. You have data in Google cloud. Did you have to set up an ETL process before you could ingest it into Rockset, and before you could make the SQL queries that you're making?
Tushar Dadlani:
We didn't have to do any of that. I was able to just click through the UI and say, "Okay, this is my bucket," and just start ingesting this data from this bucket for me. And then I just was able to query it on top of that.
Dhruba Borthakur:
Got it. Great. I have one more question. Actually, few more questions. Let me see. Hey Tushar, I have a question regarding data volume. Did your team consider using an even based camera? That way, you won't have to capture this whole unchanging blobs of image data. That's it. Is the question any sense?
Tushar Dadlani:
This is what nest or some of those other systems do, right? I think one thing to think about, the event based cameras have a couple of challenges with them. One is they work on a lot of compressed streams, prime firstly. Secondly, they give information at a very sparse frame rate. So if you think about a nest or any of those systems, you will get very sparse outputs from that system. And then I think the last part is, the types of events that you will get will be very limited too, because you're not just trying to understand what are just the events, right? You have to capture the full interaction of a shopper with the shop.
Tushar Dadlani:
And capturing that complex interaction with data is almost impossible for today, state of-the-art even cameras to do, because they don't have processing power to do all the complex algorithms on the streams. And as I was talking about before, even if you generate these events at 30 frames per second, you have to build something that processes right near the camera. Those APIs are not evolved enough for us to start using something like that yet. Hopefully that answers the question.
Dhruba Borthakur:
I see. I have more questions. Another question is that, what amount of data can Rockset handle? As it seemed for CV applications, you need the ability to store and search massive amounts of data. Is Rockset in memory, on disc, or a combination of the two? I can, here, actually answer this question, I think. Is that okay with you, Tushar?
Tushar Dadlani:
Yep, yeah. Go ahead.
Dhruba Borthakur:
The question is more about on the Rockset side of things, it says, what amounts of data can Rockset handle? Rockset is a cloud service and it scales up to the amount of data that you can put. But again, I'm not talking about data which is 144 terabytes an hour. That is the raw stream data that Tushar mentioned. Most of the time, maybe it's a few gigabytes a second. It might be the metadata that you generate from your streams, which is where you might be able to put it into any data system as of now, for processing and querying. The sweet spot for Rockset is starting from a few gigabytes of data to, let's say, tens and hundreds of terabytes. It's not in memory. So Rockset essentially stores data. It separates out the query part and the storage part, which means the storage is completely on S3 or on cloud storage. So obviously the storage is infinite in size. It's only when you process or when you make queries, data gets loaded. You have different instances in Rockset.
Dhruba Borthakur:
There are some instances in Rockset where you can pull in all data into an SSD, so that you can do very quick joins and aggregations. But basically, it's up to the use case to be able to do this flexibly. I hope I answered your question. There's no physical limit for any of these because you're running on the cloud and memory systems, disk systems, and SSD systems. All are there, available. Rockset leverages all of them. And also, like compute resources. If you look at Rockset's Aggregator Leaf Tailer architecture, you can search for it in Google. It's called Aggregator Leaf Tailer architecture. And that architecture essentially tells us or shows how Rockset separates out the compute and the storage. There is ingest compute, there's storage, and then there's query compute. So we can scale each one of them up independently, based on where your bottleneck is.
Dhruba Borthakur:
I hope I answered that question. If you have follow-up questions, please do present or please do ask here. I have a couple of more questions. Another question, Tushar, mostly for you, I think is, how does Standard Cognition developers get help from Rockset? Is it a truly self-service or is there a DBA in your company, who is helping developers?
Tushar Dadlani:
I guess, there's a mixed answer there. I think the first part is, is it self-service? I think for a person like me, it is fairly self-service. I just need to say, "Hey, this is what I need." I think the part where a DBA helps a little bit is to ... We have a lot of other data sources, right? Like I said, the volume of data is so high. So sometimes you have to collect data or filter out some of the data, and put it in Rockset because it has derived data, to some extent. And you want to join it with some non-derived data. I think that's where the DBA ... I won't call them personal DB. I'll call the person a data engineer who is responsible for all things data. And I think what is really valuable is Python API just makes it dead simple for me to just give the DBA a, hey, here's the notebook. They're like, "Okay, I'll just share you with this notebook." And my ETL gets done very trivially. I don't need to actually do any extra work, right?
Tushar Dadlani:
So essentially, what we've done is we just wrapped a lot of our ETL systems and notebooks, which run periodically to dump some of this data to Rockset. This is mostly for the join part, right? The first part is, is it self-service for me? I would say, yes. But is it self-service for a cross-functional group of people who want to associate data across different building blocks or backend APIs? That's where I think it becomes a little complex because now you're dealing with more and more different data systems. I think Rockset is self-service from that perspective, because I have tons of JSON data. And to run some queries on it, I'll just dump it into Rockset and start running queries. But when I'm trying to associate that data for the full production suite, I need to start talking to somebody else because ...
Dhruba Borthakur:
Sure. Okay, yeah. All that makes sense. I'm going to move on to the next question. One question that I have is that, to make the models run on large streaming data volumes, the software has to react quickly to real life changes that are captured by the cameras. So do you need a real, low data latency, query system? Is Rockset sufficient for your low data latency needs?
Tushar Dadlani:
A lot of times, that data is actually systems. It goes into very low level systems. All that compute that you're describing is not ... You don't want a query-like system. You want guaranteed outcomes, very fast, rapid outcomes. So you are actually building that whole system from scratch. You're relying on some sort of data system, but that data system is not supposed to be in the cloud, just from design perspective. You want it to be really fast. So you write it on your own, to be really fast, and don't rely on any existing cloud-based solution. Because as I said, the outcomes are really important in real-time.
Tushar Dadlani:
I think one part, which I would like to talk a little bit about, to go a little deeper, you can do this thing really fast. But at the same time, you are trying to figure out what are those high hotspots of data, essentially. So the data hotspots, you want to capture and dump it back into the cloud. But you can't write all the data as if all the data is a hotspot because you run into throughput issues. I think that is how we look at that for on-prem stack.
Dhruba Borthakur:
Got it. I see. Cool. I think there is another question. I think maybe it's related to the previous one. How do you handle in-store data versus data in the cloud? I don't really exactly know what that means, but I think the question probably is asking, do you handle these two datasets differently, or are they different in nature? Or do you use data systems in the story itself, to do any kind of processing? Or are they very Rock and very Standard Cognition specific?
Tushar Dadlani:
I think that goes into some of our IP space. I might want to not answer that question right now.
Dhruba Borthakur:
Sure.
Tushar Dadlani:
I hope that's not a problem, because it's too general.
Dhruba Borthakur:
Sure, yeah. Okay.
Tushar Dadlani:
If you can ask a more specific question, I think I'd be able to ...
Dhruba Borthakur:
Yeah, if somebody can maybe make that question more explicit, then I can ask it to you again. I have, actually, a question from myself. It looks like computer vision data, the amount of data, the volume and the velocity is very high. Right? So from what I have read about vision in general is that the demand or the amount of data that's being generated is growing 30 times or 35 times every two years. Whereas Moore's system or Moore's law usually says that, hey, computer system, CPU, memory, and RAM, they might double every two years, or one and a half years. Something like that, right? The amount of volume is growing so fast in this vision and AIS. Do you think it is even sustainable to be able to do these kinds of analyses? Because the distributed systems that you have to store and process is not going to be twice as efficient every two years.
Tushar Dadlani:
I think I look at it almost as, if you look at evolution, human brains took a long time to evolve, right? So some other data systems are like dinosaurs, where they are just so big and so unnecessarily large because you can't comprehend it efficiently. I think AI is that domain here, trying to build a brain. Right? But data is, okay, let's start with the amoeba. Let's rapidly evolve into a very big creature, right? And then those really big creatures can't survive because they need too much food, or they need too many resources. So I think it's almost like that parallel of evolution where, for survival, you need to become smarter. And that's true in the business landscape as well now. If you want to survive as a computer vision business, you need to get smarter on how you handle your data and how you intelligently process data. And not assume that, okay, this is just a raw data dump. I can just do some simple filtering.
Tushar Dadlani:
It's more complex than just a simple filter at that point. You're starting to build contextual complex, understanding where you are like, "Okay, this is what a person picked up." But all of that logic, which was massive datasets, now gets compressed in algorithm. It works in 98, 99% of the time. So all you're doing is building a system where understanding the adaptability piece, right? So this is the 1% I need to adapt to, but the 99% can be done by an algorithm. Right? It's evolve and adapt, almost. That's how I look at it. Hope that answers your question.
Dhruba Borthakur:
Cool. No, I think that answers. Yeah, that definitely answers it. Cool. Thank you. For the rest of the people in the webinar, we'll hang out in the Slack channel, just to answer more questions. So please keep the questions coming. Questions could be directed to Tushar. I think he's also going to join the Slack channel and answer questions. Or you could just ask questions about Rockset. All of us are there in the community channel, and we'll do an AMA now, for the next 15 minutes or so, on this channel. That's good with you, Tushar?
Tushar Dadlani:
Yep. Sounds good.
Dhruba Borthakur:
Cool. Hey, thank you, guys. Thanks a lot for joining in. Really appreciate your time.