How to index a massive, randomly selected, uncontrollable, constantantly changing dataset?
Imagine you want to index all of the snow particles in a giant snowglobe that is constantly being shaken. Each snow particle represents a piece of lengthy text (say a book) that needs to be indexed, however, the particles are constantly moving and cannot be grouped into categories of indexed and not indexed. On top of this, snow particles disappear from the globe (either through disintegration or removal) and new particles are added. It is also possible for the text of the book to be changed or updated. There is a desire to keep the database as up to date as possible, so older index data may need to be updated periodically.
a. Is there an algorithm that can prevent having to look up a particle each time to see if it has already been indexed? On a large enough dataset, this will become very burdensome. I imagine there may be more overhead associated with such an algorithm and may be cumbersome for smaller datasets, but would prove faster with larger datasets as the index grows.
b. How to handle the updating of the index in terms of making sure the text is still the same and still exists? I don't really want to check if the data is the same in the case I randomly pick up an already indexed item to prevent unecessary loops (i.e. picking indexed items multiple times instead of spending less time indexing new data). As long as the first problem is solved, i.e. each examined item is a new item, then I suppose an update scheme which goes through after a certain period of time checking for consistency will work. Any other thoughts?