Support more possible delimiter and quote characters

yehoshuadimarsky / bcpandas

High-level wrapper around BCP for high performance data transfers between pandas and SQL Server. No knowledge of BCP required!!

MIT License

125 stars 44 forks source link

Support more possible delimiter and quote characters #12

Open owen-synergy opened 4 years ago

owen-synergy commented 4 years ago

Hi Josh, I'm not sure if this is possible, but can BCP and Pandas to_csv support more delimiter and quote characters than just the ones you've specified so far? I'm possibly trying to push the limits of bcpandas too far by writing DataFrames with big json blob elements (I know you don't recommend using it with messy text data) but I didn't know if there was still potential to improve it in that regard.

Happy to try creating a PR if you think there's still potential improvements here.

yehoshuadimarsky commented 4 years ago

Fair point. Technically, I think every single character is fair game to be used as a delimiter or quote character, as long as it doesn't appear in the actual data. We can (and do) check for this in constants.py. I only specified those specific options because they are the most common ones. So we could change it to allow for all characters in ASCII 32-127. I was just unsure how this would actually work in real life as it is a bit unorthodox, and I've never tried it. But we can try it.

owen-synergy commented 4 years ago

I'll have to double check the BCP docs again... I think I remember it having to be a visible character or a few others (maybe that is the 32 - 127 range...). Using other characters for quotes does seem a bit funny to me too... I didn't see anything in the Pandas documentation as what characters it would allow for that.

I'll see if I can find some time to experiment. I was going to try and extract the set of all characters that appear in the DataFrame to check if there's two left over that can be used.

yehoshuadimarsky commented 4 years ago

Taking a stab now at working through it... Here's the link for Microsoft docs about terminators https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server

owen-synergy commented 4 years ago

Nice, I made this code snippet to grab all characters from all strings in a DataFrame, not sure if there's a more efficient way...

all_chars = set()
for col in df.columns:
    try:
        all_chars.update(set(df[col].str.cat(sep="")))
    except AttributeError:
        pass

yehoshuadimarsky commented 4 years ago

There is a better way - it's easier to just go through all the possible valid ASCII values, and find one that it's not in the DataFrame using the same code we currently use

https://github.com/yehoshuadimarsky/bcpandas/blob/d6b16141e40bbabc28ea12ed21ed05eaaf999c51/bcpandas/constants.py#L64-L68

(learned how to do this here)

puruzio commented 3 years ago

Any chance to add | (pipeline char) as the delimiter? That seems to be less likely to appear in the actual data, and somewhat commonly used as a delimiter.

yehoshuadimarsky commented 3 years ago

It already is one of the delimiter options https://github.com/yehoshuadimarsky/bcpandas/blob/db37a2c1ffba7f19a01019b833ba526622ae577c/bcpandas/constants.py#L38