Extraction of Colored Excel Cells

It is also known that the contents of an Excel file can be returned in different types from the Power Query point of view: Sheet, Table, etc., and mainly because of distrust towards the user, we use the mentioned Sheet type to avoid the user skips some row and the dynamic table extension is not applied, so we will not have all the data. But what we get is always “just” the resulting cell value. So you don’t get the calculation procedure, cell color, etc. You won’t get the result even if you query against said Excel directly using OLEDB.

OLEDB agains Excel OLEDB agains Excel

Though admittedly, this style of querying Excel also has some merit, as you can perform imaginary “SQL” queries against sheets. And that can be pretty fun (Tables$, Values$ are names of a sheets).

OLEDB with custom query agains Excel OLEDB with custom query agains Excel

But what to do when we need to get some of these parameters or worse? We need to get values based on that parameter.

Example: we have an Excel where the light yellow cells represent the values the user will fill in, and then we want to retrieve them for reporting purposes.

Ugh. I can hear the question now, so can we prepare some reasonable transformation to get these values out according to some rules? The answer? NO! Because we are looking for a universal way to extract these values. For those marked by writing macros in Excel using VBA, I’m sure more than one option to solve this has come to mind. For example, we could write a VBA script that would export these values based on the color attribute of the cell to, for example, a central Excel spreadsheet that we would then link. It would undoubtedly be the way to go, BUT Excel is not a database, and such a conversion mechanism would only cause more problems. Not to mention that in some companies, macros have been forbidden because of security. Furthermore, we will only work with uniform cell background colors, although similar thinking could be applied to other potentially searchable properties.

So back to Power Query! Is there any possibility? Let’s say yes! Most of the files we usually use, such as “.xlsx,” “.docx,” etc., are compressed folders with content primarily XML or JSON files. Why not compress that Excel and run the individual files through Power Query to find what we need?

Decompress the Excel file

Since we want to build a general procedure that can extract excels independently, we certainly don’t want to manually convert each file to “.zip” and then extract it. We need a procedure that we can use to do this directly from Power Query. Mark White has saved us a lot of work on this in his article Reading ZIP Files in Power Query, as he has already devised and compiled this procedure.

/* UNZIP FUNCTION FROM MARK WHITE LINK TO ARTICLE WITH CODE http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html */
(xlsxFile as binary) =>
    let
        Header = BinaryFormat.Record(
            [
                MiscHeader = BinaryFormat.Binary(14),
                BinarySize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                FileSize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                FileNameLen = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian),
                ExtrasLen = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian)
            ]
        ),
        HeaderChoice = BinaryFormat.Choice(
            BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
            each
                if _ <> 67324752
                // not the IsValid number? then return a dummy formatter
                then
                    BinaryFormat.Record([IsValid = false, Filename = null, Content = null])
                else
                    BinaryFormat.Choice(
                        BinaryFormat.Binary(26),
                        // Header payload – 14+4+4+2+2
                        each
                            BinaryFormat.Record(
                                [
                                    IsValid = true,
                                    Filename = BinaryFormat.Text(Header(_)[FileNameLen]),
                                    Extras = BinaryFormat.Text(Header(_)[ExtrasLen]),
                                    Content = BinaryFormat.Transform(
                                        BinaryFormat.Binary(Header(_)[BinarySize]),
                                        (x) =>
                                            try
                                                Binary.Buffer(Binary.Decompress(x, Compression.Deflate))
                                            otherwise
                                                null
                                    )
                                ]
                            ),
                        type binary
                        // enable streaming
                    )
        ),
        ZipFormat = BinaryFormat.List(HeaderChoice, each _[IsValid] = true),
        Entries = List.Transform(
            List.RemoveLastN(ZipFormat(xlsxFile), 1), (e) => [FileName = e[Filename], Content = e[Content]]
        )
    in
        Table.FromRecords(Entries)
// END OF MARK WHITE's UNZIP FUNCTION //

So if we apply it to the “binary” fingerprint of the file we are examining, we have half of the work already done because the “folder” with the content will immediately fall out.

Extracted Excel Extracted Excel

So, as a result, we have to orient ourselves and understand what exactly we are looking for to write the appropriate transformation for it.

Structure of the searched problem

This is where the fun starts because Excel works with its cells in different ways, especially in terms of content. From my research, if I open the file of that sheet and try to find some values, I get different kinds of those values. It could be because of reducing the file size, and the text values are all stored together in another file, and from the point of view of the cells themselves, they are only referenced by the index. As a result, one text value does not have to be stored, for example, five times; it simply enters it once into the common XML and puts only its standard index in the cell. Numerical values, on the other hand, are stored directly in the cell. And the difference between whether a cell is text or numeric is known using the XML Attribute “t”, which represents text. If we have a date value, it will be stored as a number, and the calculations are instead stored as two values as the definition of the calculation and the resulting value. Both are within one record.

t Attribute t Attribute

Another attribute that we should be interested in is “s,” which indicates the style of the character. This is where things get complicated because it is located in the styles file. But here, we can find that we have three different kinds of colors. Indexed colors, RGB, and Theme colors. In reality, however, Excel distinguishes them a little differently:

Excel colors Excel colors

Depending on which type of color it is, Excel assigns the color to the given style in a slightly different way or that the given color is stored elsewhere. However, we only have their IDs for now. Still no color, but getting close. For the pieces of this puzzle to fit together, you need to find the location of the individual colors:

  • Index = ./style
  • RGB = ./style
  • Theme = ./theme/themeX

Now that we know where the colors are, we can move on to the next problem. The reason is that Excel registers colors as RGB attributes, which is not entirely accurate because, according to the stored value, it is instead a HEX notation of the color, extended by two “FF” characters from the front, in most cases. So we will always have to use Text.Contains().

RGB is not a RGB but HEX with FF as a prefix RGB is not a RGB but HEX with FF as a prefix

The last piece of the puzzle needs to be mentioned! There is a point where Excel won’t save your color as HEX or even as RGB notation. This moment occurs when the user selects a color from the Gradient selection, which is automatically generated from Theme colors. At such a moment, the format will be saved within the styles file as a color ID. But this ID will lead to the original theme color, and the given color adjustment will be made using other attributes (tint, shade).

Mind exploded

Usually, this wouldn’t be a problem as it would be enough to convert the color from HEX code to RGB using the following code:

let
    output = (color as text, optional direction as nullable number) =>
        let
            dr = if direction = null then 0 else direction,
            dirList = {0, 1},
            act =
                try
                    let
                        clearedColor = Text.Replace(Text.Upper(Text.Remove(color, {"#", "(", ")"})), "RGB", ""),
                        hexList = {"A", "B", "C", "D", "E", "F"},
                        dir = dirList{List.PositionOf(dirList, dr)},
                        typeOfCalculation =
                            if dr = 0 then
                                let
                                    colorModifier = (hexColorValue as number) =>
                                        let
                                            firstPosition = Number.Mod(hexColorValue, 16),
                                            decreaser = Number.IntegerDivide(hexColorValue, 16),
                                            zeroPosition = Number.Mod(decreaser, 16),
                                            replacer = List.Transform(
                                                {zeroPosition, firstPosition},
                                                each if _ > 9 then hexList{(Number.Mod(_, 9) - 1)} else Text.From(_)
                                            )
                                        in
                                            Text.Combine(replacer, ""),
                                    listSeparator = Text.Split(clearedColor, ","),
                                    transformator = List.Transform(listSeparator, each colorModifier(Number.From(_)))
                                in
                                    "#" & Text.Combine(transformator, "")
                            else
                                let
                                    colorModifier = (decimalColorValue as text) =>
                                        let
                                            split = Splitter.SplitTextByPositions({0, 1})(decimalColorValue),
                                            modifyContent = List.Transform(
                                                split,
                                                each
                                                    let
                                                        vl = _, p = List.PositionOf(hexList, _)
                                                    in
                                                        if p < 0 then
                                                            Number.From(vl)
                                                        else
                                                            p + 10
                                            ),
                                            out = modifyContent{0} * 16 + modifyContent{1}
                                        in
                                            Text.From(out),
                                    listSeparator = Splitter.SplitTextByPositions({0, 2, 4})(clearedColor),
                                    transformator = List.Transform(listSeparator, each colorModifier(_))
                                in
                                    "RGB(" & Text.Combine(transformator, ",") & ")"
                    in
                        typeOfCalculation
                    catch (r) =>
                        if r[Message] = "We couldn't convert to Number." then
                            error
                                [
                                    Message = "Please make sure, that you are inserting correct color type based on your selection in second parameter of this fucntion!"
                                ]
                        else
                            error [Message = "Please insert only these values {0,1} -> 0 = RGB2HEX or 1 = HEX2RGB"]
        in
            act,
    documentation = [
        Documentation.Name = " get-ConverterBetweenHEXandRGB.pq ",
        Documentation.Description = " Converts RGB color code to HEX code and VICE VERSA. Native direction is from RGB to HEX, opposite direction is settable by second parameter of this function. This function can work with standart declaration of these color codes like: ""RGB(.,.,.)"" or ""#FFFFFF"". Accepted enumerations: 0 = RGB2HEX, 1 = HEX2RGB ",
        Documentation.Source = "https://www.datameerkat.com ",
        Documentation.Version = " 1.0 ",
        Documentation.Author = " Štěpán Rešl ",
        Documentation.Examples = {
            [
                Description = " Convert from HEX to RGB *doesnt require second parameter* ",
                Code = " #""get-ConverterBetweenHEXandRGB""(""RGB(123,10,50)"") ",
                Result = "#7B0A32"
            ],
            [
                Description = " Convert from HEX to RGB ",
                Code = " #""get-ConverterBetweenHEXandRGB""(""#7B0A32"",1) ",
                Result = "RGB(123,10,50)"
            ]
        }
    ]
in
    Value.ReplaceType(output, Value.ReplaceMetadata(Value.Type(output), documentation))

Then perform tint or shade color adjustment and conversion back to HEX code.

Color modification Color modification

But if you expect it to work in Excel anyway, you will be disappointed. According to all tests, Excel calculates this color change in some way because this standard procedure does not work 100% of the time I do the tests. At the same time, even if I tried this color modification using native VBA code and subsequently extracted the individual colors into RGB notations, the correct color codes did not always come out anyway. We may be able to reveal his approach to coding in the future, but unfortunately, until then, these Gradient colors are unobtainable to us.

Now we know everything we need. So all that remains is to deal with it somehow.

Building procedure and function

Colors are always linked to a style, and each cell has some style assigned. We can think about the problem: We get all the IDs of the colors we seek. Next, we filter all the styles so that we only have those that have a link to one of the colors we are looking for through indexes. From the obtained styles, we will keep only and only their IDs. We can subsequently filter all the cells with them and select only the ones we need. In the end, we’ll check what type of value it is, and if it’s numerical, we’ll save it straight away; for text output, we still have to reach into the text value repository according to the index to get this value.

It sounds wild but doable! I believe that the whole function could be optimized even further, but so far, this was enough for me; where in addition to the searched colors, I also added the possibility of removing some of the colors from the export:

let
    output = (xslxBinary as binary, optional searchedColor as text, optional notSearchedColor as text) =>
        let
            /* UNZIP FUNCTION FROM MARK WHITE
     LINK TO ARTICLE WITH CODE http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html */
            unZipOfXLSX = (xlsxFile as binary) =>
                let
                    Header = BinaryFormat.Record(
                        [
                            MiscHeader = BinaryFormat.Binary(14),
                            BinarySize = BinaryFormat.ByteOrder(
                                BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian
                            ),
                            FileSize = BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                            FileNameLen = BinaryFormat.ByteOrder(
                                BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian
                            ),
                            ExtrasLen = BinaryFormat.ByteOrder(
                                BinaryFormat.UnsignedInteger16, ByteOrder.LittleEndian
                            )
                        ]
                    ),
                    HeaderChoice = BinaryFormat.Choice(
                        BinaryFormat.ByteOrder(BinaryFormat.UnsignedInteger32, ByteOrder.LittleEndian),
                        each
                            if _ <> 67324752
                            // not the IsValid number? then return a dummy formatter
                            then
                                BinaryFormat.Record([IsValid = false, Filename = null, Content = null])
                            else
                                BinaryFormat.Choice(
                                    BinaryFormat.Binary(26),
                                    // Header payload – 14+4+4+2+2
                                    each
                                        BinaryFormat.Record(
                                            [
                                                IsValid = true,
                                                Filename = BinaryFormat.Text(Header(_)[FileNameLen]),
                                                Extras = BinaryFormat.Text(Header(_)[ExtrasLen]),
                                                Content = BinaryFormat.Transform(
                                                    BinaryFormat.Binary(Header(_)[BinarySize]),
                                                    (x) =>
                                                        try
                                                            Binary.Buffer(Binary.Decompress(x, Compression.Deflate))
                                                        otherwise
                                                            null
                                                )
                                            ]
                                        ),
                                    type binary
                                    // enable streaming
                                )
                    ),
                    ZipFormat = BinaryFormat.List(HeaderChoice, each _[IsValid] = true),
                    Entries = List.Transform(
                        List.RemoveLastN(ZipFormat(xlsxFile), 1),
                        (e) => [FileName = e[Filename], Content = e[Content]]
                    )
                in
                    Table.FromRecords(Entries),
            // END OF MARK WHITE's UNZIP FUNCTION //
            // Suport Functions //
            themeColors = (themeBinary as binary) =>
                let
                    schemaColorList =
                        let
                            initTbl = Table.Combine(
                                Record.ToList(
                                    Table.RemoveColumns(
                                        Xml.Tables(themeBinary)[themeElements]{0}[clrScheme]{0}, "Attribute:name"
                                    ){
                                        0
                                    }
                                )
                            ),
                            extraction = Table.AddColumn(
                                initTbl,
                                "clrs",
                                each
                                    (
                                        if [sysClr] <> null then
                                            [sysClr]{0}[#"Attribute:lastClr"]
                                        else
                                            [srgbClr]{0}[#"Attribute:val"]
                                    )
                                        & "00"
                            )[clrs]
                        in
                            extraction
                in
                    schemaColorList,
            stylesIDs = (stylesXML as binary, themeColors as list) =>
                let
                    Source = Xml.Tables(stylesXML){[Name = "fills"]}[Table]{0}[fill],
                    expandPatternOfFill = Table.ExpandTableColumn(Source, "patternFill", {"fgColor"}, {"fgColor"}),
                    expandBackGroundColor = Table.ExpandTableColumn(
                        expandPatternOfFill,
                        "fgColor",
                        {"Attribute:indexed", "Attribute:rgb", "Attribute:theme"},
                        {"indexed", "rgb", "theme"}
                    ),
                    indexedColorsList = List.Buffer(
                        Table.ToList(
                            Xml.Tables(
                                stylesXML{[Name = "styles.xml"]}[Content]
                            ){
                                [Name = "colors"]
                            }[Table]{
                                [Name = "indexedColors"]
                            }[Table]{
                                0
                            }[Table]
                        )
                    ),
                    indexCreator = Table.AddIndexColumn(expandBackGroundColor, "Index", 0, 1, Int64.Type),
                    filterForSpecificColor =
                        if searchedColor <> null then
                            let
                                colorSeeker = Table.AddColumn(
                                    indexCreator,
                                    "colorSeeker",
                                    each
                                        try
                                            if [indexed] <> null then
                                                indexedColorsList{Number.From([indexed])}
                                            else if [rgb] <> null then
                                                [rgb]
                                            else
                                                themeColors{Number.From([theme])}
                                            otherwise null
                                )
                            in
                                Table.SelectRows(colorSeeker, each Text.Contains([colorSeeker], searchedColor))
                        else if notSearchedColor <> null then
                            let
                                colorSeeker = Table.AddColumn(
                                    indexCreator,
                                    "colorSeeker",
                                    each
                                        try
                                            if [indexed] <> null then
                                                indexedColorsList{Number.From([indexed])}
                                            else if [rgb] <> null then
                                                [rgb]
                                            else
                                                themeColors{Number.From([theme])}
                                            otherwise null
                                )
                            in
                                Table.SelectRows(
                                    colorSeeker,
                                    each
                                        not Text.Contains([colorSeeker], notSearchedColor) and [colorSeeker] <> null
                                )
                        else
                            let
                                nonNullCounter = Table.AddColumn(
                                    indexCreator,
                                    "counter",
                                    each List.NonNullCount(Record.ToList(Record.RemoveFields(_, "Index")))
                                )
                            in
                                Table.SelectRows(nonNullCounter, each [counter] <> 0)
                in
                    filterForSpecificColor[Index],
            XfsIDs = (stylesXML as binary, listOfStyles as list) =>
                let
                    Source = Xml.Tables(stylesXML){[Name = "cellXfs"]}[Table][xf]{0}[[#"Attribute:fillId"]],
                    indexCreator = List.Transform(
                        Table.SelectRows(
                            Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
                            each List.Contains(listOfStyles, Number.From([#"Attribute:fillId"]))
                        )[Index],
                        each Text.From(_)
                    )
                in
                    indexCreator,
            valueExtract = (workheetBinaries as binary, XfsSearchingIds as list, dt as list) =>
                let
                    sheetExtraction = Xml.Tables(workheetBinaries){[Name = "sheetData"]}[Table][Table]{0}[[c]],
                    expandSheetData = Table.SelectRows(
                        Table.AddColumn(
                            Table.ExpandTableColumn(
                                sheetExtraction,
                                "c",
                                {"v", "Attribute:r", "Attribute:s", "Attribute:t"},
                                {"v", "r", "s", "t"}
                            ),
                            "search",
                            each let s = _[s] in List.Select(XfsSearchingIds, each _ = s){0} ?
                        ),
                        each [search] <> null
                    )[[v], [t], [search], [s], [r]],
                    output = Table.AddColumn(
                        expandSheetData,
                        "val",
                        each
                            if [t] = null then
                                [v]
                            else
                                try dt{Number.From([v])}[#"Element:Text"]{0} otherwise dt{Number.From([v])}
                    )
                in
                    output[[val], [r]],
            sharedStrings = (shString as binary) => let Source = Xml.Tables(shString)[si]{0}[t] in Source,
            //Data Extraction
            unzipingXLSX = unZipOfXLSX(xslxBinary),
            themes = List.Buffer(themeColors(unzipingXLSX{[FileName = "xl/theme/theme1.xml"]}[Content])),
            styles = List.Buffer(stylesIDs(unzipingXLSX{[FileName = "xl/styles.xml"]}[Content], themes)),
            xfs = List.Buffer(XfsIDs(unzipingXLSX{[FileName = "xl/styles.xml"]}[Content], styles)),
            data = List.Buffer(sharedStrings(unzipingXLSX{[FileName = "xl/sharedStrings.xml"]}[Content])),
            extraction =
                let
                    generator = Table.RenameColumns(
                        Table.SelectRows(
                            Table.TransformColumns(
                                Table.SelectRows(
                                    unzipingXLSX,
                                    each
                                        Text.Contains([FileName], "worksheets")
                                        and Text.End([FileName], 5) <> ".rels"
                                ),
                                {
                                    {"FileName", each Text.BetweenDelimiters(_, "xl/worksheets/", ".xml")},
                                    {"Content", each valueExtract(_, xfs, data)}
                                }
                            ),
                            each not Table.IsEmpty(_[Content])
                        ),
                        {"FileName", "Sheet"}
                    ),
                    expander = Table.TransformColumnTypes(
                        Table.ExpandTableColumn(generator, "Content", {"val", "r"}, {"Value", "Position"}),
                        { {"Sheet", type text}, {"Value", type text}, {"Position", type text} }
                    )
                in
                    expander
        in
            extraction,
    documentation = [
        Documentation.Name = " get-ColoredExcelCells.pq ",
        Documentation.Description = " Function for extracting colored cells from Excel file. ",
        Documentation.Source = "https://www.datameerkat.com ",
        Documentation.Version = " 1.0 ",
        Documentation.Author = " Štěpán Rešl "
    ]
in
    Value.ReplaceType(
        output,
        Value.ReplaceMetadata(
            Value.Type(output),
            documentation
        )
    )

I want to test it, though! So here I have an Excel demo with two worksheets, different values placed, and different colors.

Demo Excel Demo Excel

Let’s try using the function mentioned above to get all the cells with the color “#ABD88C”:

Extracted values Extracted values

YES!!! We have all cells with #ABD88C background color! Only dates look like numbers. (As always)

Summary

I would dare to summarize it with a classic message: “Don’t try this at home.” But that wouldn’t be true! It was a challenge that Tomáš Prokop came to me with, and I’m glad that I could crack this nut. Of course, there are better-looking and more straightforward options than solving it using Power Query, but if there is no other option, it’s certainly good to know that it can be done. In general: “Don’t say something can’t be done, or someone who doesn’t know will come and do it.” Like for example, this. I’ve often heard that it can’t be done, and the opposite is true. It won’t hurt if it’s about personal analysts or a one-time export. But using something like that and a production solution is playing with fire, and you can get very burned.

The function in this state returns ONLY the cells marked with a color. If you would like a different variant that would return, for example, a cell next to it, then the code needs to be slightly modified, but of course, it would also be possible!

The last thing I would like to add is: Be bold and try to break new ground. At most, you will find it impossible in that direction. It is still a valuable lesson; you won’t go down that road again. But if you could, and it turned out to be ineffective, you can still learn a lot. That’s why it’s good to doubt, experiment and try to discover new ways, and don’t be afraid to make mistakes when you’re learning. So then, when you need it, you will know a way to solve it.

Extraction of Colored Excel Cells
Older post

Domains

Everyone knows that Power Query can get data from individual cells in Excel. But what about only colored cells?

Newer post

Lessons learnt from PySpark Notebooks and extracting APIs

Everyone knows that Power Query can get data from individual cells in Excel. But what about only colored cells?

Extraction of Colored Excel Cells