It is also known that the contents of an Excel file can be returned in different types from the Power Query point of view: Sheet, Table, etc., and mainly because of distrust towards the user, we use the mentioned Sheet type to avoid the user skips some row and the dynamic table extension is not applied, so we will not have all the data. But what we get is always “just” the resulting cell value. So you don’t get the calculation procedure, cell color, etc. You won’t get the result even if you query against said Excel directly using OLEDB.
OLEDB agains Excel
Though admittedly, this style of querying Excel also has some merit, as you can perform imaginary “SQL” queries against sheets. And that can be pretty fun (Tables$, Values$ are names of a sheets).
OLEDB with custom query agains Excel
But what to do when we need to get some of these parameters or worse? We need to get values based on that parameter.
Example: we have an Excel where the light yellow cells represent the values the user will fill in, and then we want to retrieve them for reporting purposes.
Ugh. I can hear the question now, so can we prepare some reasonable transformation to get these values out according to some rules? The answer? NO! Because we are looking for a universal way to extract these values. For those marked by writing macros in Excel using VBA, I’m sure more than one option to solve this has come to mind. For example, we could write a VBA script that would export these values based on the color attribute of the cell to, for example, a central Excel spreadsheet that we would then link. It would undoubtedly be the way to go, BUT Excel is not a database, and such a conversion mechanism would only cause more problems. Not to mention that in some companies, macros have been forbidden because of security. Furthermore, we will only work with uniform cell background colors, although similar thinking could be applied to other potentially searchable properties.
So back to Power Query! Is there any possibility? Let’s say yes! Most of the files we usually use, such as “.xlsx,” “.docx,” etc., are compressed folders with content primarily XML or JSON files. Why not compress that Excel and run the individual files through Power Query to find what we need?
Decompress the Excel file
Since we want to build a general procedure that can extract excels independently, we certainly don’t want to manually convert each file to “.zip” and then extract it. We need a procedure that we can use to do this directly from Power Query. Mark White has saved us a lot of work on this in his article Reading ZIP Files in Power Query, as he has already devised and compiled this procedure.
So if we apply it to the “binary” fingerprint of the file we are examining, we have half of the work already done because the “folder” with the content will immediately fall out.
So, as a result, we have to orient ourselves and understand what exactly we are looking for to write the appropriate transformation for it.
Structure of the searched problem
This is where the fun starts because Excel works with its cells in different ways, especially in terms of content. From my research, if I open the file of that sheet and try to find some values, I get different kinds of those values. It could be because of reducing the file size, and the text values are all stored together in another file, and from the point of view of the cells themselves, they are only referenced by the index. As a result, one text value does not have to be stored, for example, five times; it simply enters it once into the common XML and puts only its standard index in the cell. Numerical values, on the other hand, are stored directly in the cell. And the difference between whether a cell is text or numeric is known using the XML Attribute “t”, which represents text. If we have a date value, it will be stored as a number, and the calculations are instead stored as two values as the definition of the calculation and the resulting value. Both are within one record.
Another attribute that we should be interested in is “s,” which indicates the style of the character. This is where things get complicated because it is located in the styles file. But here, we can find that we have three different kinds of colors. Indexed colors, RGB, and Theme colors. In reality, however, Excel distinguishes them a little differently:
Depending on which type of color it is, Excel assigns the color to the given style in a slightly different way or that the given color is stored elsewhere. However, we only have their IDs for now. Still no color, but getting close. For the pieces of this puzzle to fit together, you need to find the location of the individual colors:
Index = ./style
RGB = ./style
Theme = ./theme/themeX
Now that we know where the colors are, we can move on to the next problem. The reason is that Excel registers colors as RGB attributes, which is not entirely accurate because, according to the stored value, it is instead a HEX notation of the color, extended by two “FF” characters from the front, in most cases. So we will always have to use Text.Contains().
RGB is not a RGB but HEX with FF as a prefix
The last piece of the puzzle needs to be mentioned! There is a point where Excel won’t save your color as HEX or even as RGB notation. This moment occurs when the user selects a color from the Gradient selection, which is automatically generated from Theme colors. At such a moment, the format will be saved within the styles file as a color ID. But this ID will lead to the original theme color, and the given color adjustment will be made using other attributes (tint, shade).
Usually, this wouldn’t be a problem as it would be enough to convert the color from HEX code to RGB using the following code:
Then perform tint or shade color adjustment and conversion back to HEX code.
But if you expect it to work in Excel anyway, you will be disappointed. According to all tests, Excel calculates this color change in some way because this standard procedure does not work 100% of the time I do the tests. At the same time, even if I tried this color modification using native VBA code and subsequently extracted the individual colors into RGB notations, the correct color codes did not always come out anyway. We may be able to reveal his approach to coding in the future, but unfortunately, until then, these Gradient colors are unobtainable to us.
Now we know everything we need. So all that remains is to deal with it somehow.
Building procedure and function
Colors are always linked to a style, and each cell has some style assigned. We can think about the problem: We get all the IDs of the colors we seek. Next, we filter all the styles so that we only have those that have a link to one of the colors we are looking for through indexes. From the obtained styles, we will keep only and only their IDs. We can subsequently filter all the cells with them and select only the ones we need. In the end, we’ll check what type of value it is, and if it’s numerical, we’ll save it straight away; for text output, we still have to reach into the text value repository according to the index to get this value.
It sounds wild but doable! I believe that the whole function could be optimized even further, but so far, this was enough for me; where in addition to the searched colors, I also added the possibility of removing some of the colors from the export:
I want to test it, though! So here I have an Excel demo with two worksheets, different values placed, and different colors.
Let’s try using the function mentioned above to get all the cells with the color “#ABD88C”:
YES!!! We have all cells with #ABD88C background color! Only dates look like numbers. (As always)
I would dare to summarize it with a classic message: “Don’t try this at home.” But that wouldn’t be true! It was a challenge that Tomáš Prokop came to me with, and I’m glad that I could crack this nut. Of course, there are better-looking and more straightforward options than solving it using Power Query, but if there is no other option, it’s certainly good to know that it can be done. In general: “Don’t say something can’t be done, or someone who doesn’t know will come and do it.” Like for example, this. I’ve often heard that it can’t be done, and the opposite is true. It won’t hurt if it’s about personal analysts or a one-time export. But using something like that and a production solution is playing with fire, and you can get very burned.
The function in this state returns ONLY the cells marked with a color. If you would like a different variant that would return, for example, a cell next to it, then the code needs to be slightly modified, but of course, it would also be possible!
The last thing I would like to add is: Be bold and try to break new ground. At most, you will find it impossible in that direction. It is still a valuable lesson; you won’t go down that road again. But if you could, and it turned out to be ineffective, you can still learn a lot. That’s why it’s good to doubt, experiment and try to discover new ways, and don’t be afraid to make mistakes when you’re learning. So then, when you need it, you will know a way to solve it.