wiglenet / wigle-wifi-wardriving

Nethugging client for Android, from wigle.net
https://wigle.net
BSD 3-Clause "New" or "Revised" License
649 stars 199 forks source link

Record Bluetooth "Service Data" advertised data #616

Open XenoKovah opened 10 months ago

XenoKovah commented 10 months ago

This is a request to start collecting additional information from Bluetooth Low Energy Advertisements (but not BT Classic AFAIK).

One of the types of information that can be advertised by a device is "Service Data" (type 0x16, https://developer.android.com/reference/android/bluetooth/le/ScanRecord#DATA_TYPE_SERVICE_DATA_16_BIT) which is a UUID16 followed by arbitrary-length service-specific data. (It's not really clear to me how this functionally differs from manufacturer-specific data of type 0xFF, other than the fact that the UUID16 could be a service ID rather than only a company ID, as in the case of manufacturer-specific data.)

Company IDs can be looked up here: https://bitbucket.org/bluetooth-SIG/public/src/main/assigned_numbers/assigned_numbers/uuids/member_uuids.yaml, service IDs can be looked up here: https://bitbucket.org/bluetooth-SIG/public/src/main/assigned_numbers/assigned_numbers/uuids/service_uuids.yaml.

Within my data, this data type is one of the rarer types, and therefore when it comes to prioritizing these requests, it can probably be done last.

Example with Company ID:

Pasted Graphic 10

Example with Service ID:

Pasted Graphic 11
rksh commented 10 months ago

Compared to the measured rarity of the primary service UUIDs, and then the relative rarity of mfgr data, how common are these? This feels pretty far along the curve of diminishing returns.

XenoKovah commented 10 months ago

When trying to answer that question, I noticed that actually I didn't have the Tile example above in my database. And when I went back I found out that this was because this was a core case which I couldn't handle properly with Wireshark-based parsing, because it overlapped too frequently with the type 2 and 3 cases, and then erroneously included their UUIDs instead of only the service data UUID. So apparently I haven't extracted this data from any but a small test before I went asking on the Wireshark forums and forgot about it. So I'm going to have to see if I can find something else to parse the btsnoop HCI logs more effectively, and then extract the data from all my logs and get back to you on that.

But just keep in mind you have about 3 orders of magnitude more data than me, from far more places in the world. And since there can be entirely geographic-specific devices[1], there can be pockets of data behavior that I've never seen. So to me it feels worth it to collect basically all advertised BT data, since it's just there for passive collection. And this is just one of the ways I know to associate devices with companies (that I know WiGLE can collect), so I wanted to be comprehensive in the suggested coverage. But I admit I don't know the extra overhead for each new data type. Can you say more about that? (Because it could then affect the other advertised data types I will suggest collecting.)

[1] E.g. there was a DEF CON talk about Mobile Point of Sale terminals analyzed in Argentina, and when I went and looked for their names based on a regex in the WiGLE data, they were only seen there. Or I've seen the same with some Korean MPOS appearing only in Korea. Or, if you go search for "BLE003U" in WiGLE, there's some device (that I frustratingly haven't identified yet!) that I suspect is associated with Maryland lottery, and so you can see a whole bunch of them in Maryland, and they literally stop at the Maryland border and don't cross over into DC or Virginia. But then they start popping up again in Pennsylvania and New York, so I expect they're probably used by their lotteries as well. The point being that part of the value of WiGLE to researchers is what it can show beyond what they themselves have ever seen, which is why extra collection could be very impactful.

rksh commented 10 months ago

We should be up-front about our appetite for bluetooth data: MAC addresses (mostly randomized) are "in" scope, and vendor/manufacturer information isn't currently considered problematic, but data that CAN serve as a more specific/unique fingerprint or compromise individual privacy is outside the scope of the project and will remain so.

It may also be reasonable to facilitate local collection of this information in the client, but server-side aggregation is currently considered outside the scope WiGLE's mission to educate and help secure.

XenoKovah commented 10 months ago

Duly noted. I don't think stuff I've requested yet falls in that category. (BTW you know that the BDADDR/MAC address is unrandomized on all BT classic devices, and some BLE devices, the same as with WiFi, and consequently that forms a unique global signature in and of itself, right?)

rksh commented 10 months ago

Does being this condescending usually get you what you want?

On Fri, Sep 8, 2023 at 16:51 XenoKovah @.***> wrote:

Duly noted. I don't think stuff I've requested yet falls in that category. (BTW you know that the BDADDR/MAC address is unrandomized on all BT classic devices, and some BLE devices, the same as with WiFi, and consequently that forms a unique global signature in and of itself, right?)

— Reply to this email directly, view it on GitHub https://github.com/wiglenet/wigle-wifi-wardriving/issues/616#issuecomment-1712343685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGW3FVO7FZPG7E3XPN5P6TXZOVORANCNFSM6AAAAAA3RFCHGU . You are receiving this because you commented.Message ID: @.***>

XenoKovah commented 10 months ago

I wasn't being condescending. I suspect this is one of those "tone doesn't come through in text" sort of things.

I don't want to make any assumptions about what's currently known about BT since WiFi has been the emphasis. And it was specifically the "MAC addresses (mostly randomized)" statement that threw me off (specifically the parenthetical statement). Since they're not randomized at all for BT classic nor BT public static. But you seemed to be saying that "data that CAN serve as a more specific/unique fingerprint" was out of scope, but then saying that BDADDRs were in scope, which seemed like a contradiction. Unless you meant that you had only been meaning to collect the randomized BLE addresses thus far, in which case the intention was not currently occurring. So I was trying to make sure we were on the same page about that, before getting into any concerns you had about the new data that's being requested.

(The first sentence was a simple acknowledgement that I wouldn't be requesting anything I thought contravened that guidance. The second sentence was a rather a request to contradict me if I misunderstood and that the existing data request had already went beyond the guidance. But I posted it first from an anonymous account and then forgot to copy it before I deleted it and signed back in here, and I forgot to put a question mark at the end of the second sentence when I re-wrote it, so maybe that's partially the misinterpretation?)

bobzilladev commented 10 months ago

Taking this as an opportunity to do some long-needed overall backlog grooming to hopefully get folks on the same page.

This data is straying into an area where it can aid in fingerprinting otherwise well-behaving address-rotating devices, so the project is not going to pursue centrally aggregating it. It may be interesting for research and site-surveying, so this ticket will be left open in case someone in the community would like to take on adding local display/collection of this data to the app.