xtensor-stack / xtensor

C++ tensors with broadcasting and lazy computing
BSD 3-Clause "New" or "Revised" License
3.36k stars 398 forks source link

different outcome of `xt::argsort` on different environments when data contains ties #2677

Closed ThibHlln closed 1 year ago

ThibHlln commented 1 year ago

Hi,

There seems to be some inconsistency when using xt::argsort on data containing ties depending on the environment you work on.

For example, with this dataset (data.csv):

#include <fstream>
#include <xtensor/xcsv.hpp>
#include <xtensor/xsort.hpp>
#include <xtensor/xio.hpp>

std::ifstream ifs;
ifs.open("data.csv");
xt::xtensor<double, 2> d = xt::load_csv<double>(ifs);
ifs.close();

On macOS with Clang:

std::cout << xt::argsort(d, 1) << std::endl;
// {{239, 238, 237, 236, 234, 233, 235, 259, 258, 222, 260, 257, 254, 223, 256,
// ...

On Windows with MSVC:

std::cout << xt::argsort(d, 1) << std::endl;
// {{239, 238, 237, 236, 234, 235, 259, 233, 222, 258, 257, 260, 223, 254, 256,
// ...

Also, another inconvenience is that the xt::argsort algorithm does not seem to match any of the numpy.argsort algorithms, at least when ties are involved:

import numpy as np

d = np.genfromtxt("data.csv", delimiter=',', dtype=float)[np.newaxis, :]
print(np.argsort(d, axis=1, kind='quicksort'))
# [[239 238 237 236 234 233 235 259 258 222 257 260 256 223 254
# ...
print(np.argsort(d, axis=1, kind='mergesort'))
# [[239 238 237 236 234 233 235 259 222 258 257 260 223 254 256
# ...
print(np.argsort(d, axis=1, kind='heapsort'))
# [[239 238 237 236 234 259 235 233 258 222 260 257 223 254 256
# ...
tdegeus commented 1 year ago

Thanks. Look worrisome. Could you post some small data to reproduce?

ThibHlln commented 1 year ago

Hi @tdegeus

I tried to generate a smaller series to reproduce the problem but did not succeed, this is why I provided the CSV file in my first post, on which I know there is a problem (on my side anyway).

tdegeus commented 1 year ago

Understood. To help debugging could you:

Check that the output is sorted

auto sorter = xt::argsort(d, 1);
for (size_t i = 1; i < sorter.shape(1); ++i) {
    if (d(sorter(0, i)) < d(sorter(0, i - 1))) {
        throw std::runtime_assertion("not sorted");
    }
}

Replace the csv load with

xt::xtensor<double, 2> d = {{2247.,2044.,2037.,1825.,1699.,1600.,1501.,1432.,1440.,1388.,1299.,1259.,1211.,1177.,1124.,1121.,1399.,1179.,1102.,1055.,1017.,1001.,979.,953.,925.,927.,1899.,4782.,2601.,3050.,1998.,3478.,5762.,8745.,7777.,4086.,2968.,2456.,2138.,3199.,7187.,7165.,3942.,11588.,4643.,3618.,3184.,6052.,3723.,3356.,2645.,2330.,2103.,1928.,1890.,3624.,3647.,5821.,2949.,4161.,3855.,5200.,3162.,3896.,19818.,5228.,3711.,4874.,4868.,4267.,2978.,2748.,2500.,2276.,2107.,1969.,1815.,1714.,1649.,1561.,1491.,1428.,1347.,1288.,1245.,1207.,1163.,1189.,1558.,1313.,1186.,1147.,1143.,1099.,1036.,1008.,982.,953.,924.,899.,880.,858.,836.,818.,791.,762.,742.,720.,701.,685.,674.,658.,643.,625.,609.,602.,594.,594.,625.,616.,707.,759.,929.,755.,684.,737.,804.,680.,639.,620.,706.,2474.,1406.,2686.,2037.,1410.,1130.,992.,901.,847.,798.,766.,750.,2793.,1574.,1087.,960.,883.,823.,782.,755.,730.,705.,677.,744.,932.,770.,794.,804.,728.,721.,765.,826.,795.,784.,875.,711.,795.,2295.,1265.,1115.,936.,817.,742.,713.,680.,647.,647.,671.,651.,692.,605.,576.,612.,564.,536.,526.,516.,506.,494.,473.,459.,451.,450.,441.,428.,430.,453.,467.,444.,413.,390.,376.,373.,368.,360.,359.,351.,348.,345.,343.,340.,337.,341.,529.,476.,448.,397.,367.,354.,342.,332.,320.,324.,327.,332.,340.,359.,331.,382.,371.,356.,333.,318.,316.,318.,312.,308.,300.,298.,399.,448.,669.,534.,402.,384.,361.,347.,337.,342.,349.,339.,333.,325.,324.,326.,324.,322.,320.,318.,322.,327.,327.,333.,329.,333.,353.,436.,516.,412.,402.,398.,551.,449.,416.,484.,801.,537.,457.,483.,534.,477.,483.,582.,575.,557.,504.,477.,515.,643.,627.,627.,546.,515.,497.,577.,1439.,1198.,1591.,1815.,1448.,1117.,960.,847.,809.,788.,749.,896.,870.,801.,752.}};

And also check for

xt::xtensor<double, 1> d = {2247.,2044.,2037.,1825.,1699.,1600.,1501.,1432.,1440.,1388.,1299.,1259.,1211.,1177.,1124.,1121.,1399.,1179.,1102.,1055.,1017.,1001.,979.,953.,925.,927.,1899.,4782.,2601.,3050.,1998.,3478.,5762.,8745.,7777.,4086.,2968.,2456.,2138.,3199.,7187.,7165.,3942.,11588.,4643.,3618.,3184.,6052.,3723.,3356.,2645.,2330.,2103.,1928.,1890.,3624.,3647.,5821.,2949.,4161.,3855.,5200.,3162.,3896.,19818.,5228.,3711.,4874.,4868.,4267.,2978.,2748.,2500.,2276.,2107.,1969.,1815.,1714.,1649.,1561.,1491.,1428.,1347.,1288.,1245.,1207.,1163.,1189.,1558.,1313.,1186.,1147.,1143.,1099.,1036.,1008.,982.,953.,924.,899.,880.,858.,836.,818.,791.,762.,742.,720.,701.,685.,674.,658.,643.,625.,609.,602.,594.,594.,625.,616.,707.,759.,929.,755.,684.,737.,804.,680.,639.,620.,706.,2474.,1406.,2686.,2037.,1410.,1130.,992.,901.,847.,798.,766.,750.,2793.,1574.,1087.,960.,883.,823.,782.,755.,730.,705.,677.,744.,932.,770.,794.,804.,728.,721.,765.,826.,795.,784.,875.,711.,795.,2295.,1265.,1115.,936.,817.,742.,713.,680.,647.,647.,671.,651.,692.,605.,576.,612.,564.,536.,526.,516.,506.,494.,473.,459.,451.,450.,441.,428.,430.,453.,467.,444.,413.,390.,376.,373.,368.,360.,359.,351.,348.,345.,343.,340.,337.,341.,529.,476.,448.,397.,367.,354.,342.,332.,320.,324.,327.,332.,340.,359.,331.,382.,371.,356.,333.,318.,316.,318.,312.,308.,300.,298.,399.,448.,669.,534.,402.,384.,361.,347.,337.,342.,349.,339.,333.,325.,324.,326.,324.,322.,320.,318.,322.,327.,327.,333.,329.,333.,353.,436.,516.,412.,402.,398.,551.,449.,416.,484.,801.,537.,457.,483.,534.,477.,483.,582.,575.,557.,504.,477.,515.,643.,627.,627.,546.,515.,497.,577.,1439.,1198.,1591.,1815.,1448.,1117.,960.,847.,809.,788.,749.,896.,870.,801.,752.};
ThibHlln commented 1 year ago

Thanks @tdegeus

So, yes the output is sorted.

And I tried replacing the CSV load with your lines (both 2D and 1D), and the result remain different between macOS+clang and Windows+MSVC.

tdegeus commented 1 year ago

I did not get that you have repetitions. In that case it is not surprising that you don't get the same result. The implementation of the quicksort algorithm is likely not the same for all compilers. NumPy might be using its own algorithm that forces the same result on different platforms.

ThibHlln commented 1 year ago

Oh I see, you are right:

The order of equal elements is not guaranteed to be preserved. https://en.cppreference.com/w/cpp/algorithm/sort

I guess there is no way of telling xtensor to use stable_sort as sorting algorithm (when one is ready to pay the price of slower computation to get the guarantee that the order of equal elements is preserved)?

tdegeus commented 1 year ago

Not at the moment no.

ThibHlln commented 1 year ago

Thanks @tdegeus, I have had a go at making it possible now, proposal in #2681.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. It will be automatically closed in 14 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.