yalenazca / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

test expression for collapse columns #287

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Wide, sparse tables are awkward to explore because most columns are empty 
(or contain just one item).  You have to keep scrolling right and left to see 
informative items. See, e.g.
https://spreadsheets.google.com/pub?key=0AnZb5H7tDMvTdEx2eExqc0NRaURjT2djSDdLanZ
OWmc&hl=en&gid=0

2. How about expanding the "collapse column" options to include an option for a 
GREL test expression evaluated across all the column items in the selected 
rows?  If the expression evaluates to true, the column is collapsed.  If you 
didn't want to add this to the right/left collapse options for the individual 
columns, just having it in the all dropdown menu would be very handy.  With the 
latter setup, the test expression would be applied to all the columns of the 
visible rows.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Original issue reported on code.google.com by galbith...@galbithink.org on 15 Dec 2010 at 6:06

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 15 Dec 2010 at 6:36

GoogleCodeExporter commented 9 years ago
Could you not apply multiple Facets to ease exploration ? Even using a 
scatterplot facet on those few numeric columns you have would probably be 
useful.  Try exploring more with facets and let us know if it misses the point 
somewhere.

Original comment by thadguidry on 15 Dec 2010 at 3:05

GoogleCodeExporter commented 9 years ago
Maybe I'm not understanding, but don't multiple facets just change which rows 
are visible?  With my 1920x1086 resolution screen, I can see a maximum of 14 
(uncollapsed) columns. Suppose I have a table with many more columns than 14, 
with interesting facets containing many columns that have just blank or 
otherwise uniform contents.  In short, suppose I have a very badly designed 
table from a traditional data modeling perspective.  The proposed enhancement 
is meant to allow the user to focus on interesting (varying) data in such bad 
tables.

Google Refine is great for cleaning up messy data-item tables.  Badly 
structured tables may be less common.  But, as mentioned previously, my 
approach to collecting and compiling human-generated data creates "bad" tables. 
 See related discussion at Issue 286:
http://code.google.com/p/google-refine/issues/detail?id=286&start=100

Original comment by galbith...@galbithink.org on 16 Dec 2010 at 3:22

GoogleCodeExporter commented 9 years ago
Some related thoughts:
(from 
http://purplemotes.net/2010/12/19/badly-structured-tables-have-a-bright-future/
See there for post with embedded links)

badly structured tables have a bright future

Which is a better, one big table, or two or more smaller tables?  The 
organization of the data sources, the number of smaller tables, the extent of 
the relationships between the smaller tables, and economies in table processing 
all affect the balance of advantage.  But cheaper storage, cheaper computing 
power, and fancier data tools probably favor the unified table.  At the limit 
of costless storage, costless processing, and tools that make huge masses of 
data transparent, you can handle a component of the data as easily as you can 
handle all the data.  Hence in those circumstances, using one big table is the 
dominant strategy.[*]

Unified tables are likely to be badly structured from a traditional data 
modeling perspective.  With n disjoint components, the unified table has the 
form of a diagonal matrix of tables, where the diagonal elements are the 
disjoint components and the off-diagonal elements are empty matrices.  It's a 
huge waste of space.  But for the magnitudes of data that humans generate and 
curate by hand, storage costs are so small as to be irrelevant.   Organization, 
in contrast, is always a burden to action.  The simpler the organization, the 
greater the possibilities for decentralized, easily initiated action.

Consider collecting data from company reports to investors.  Such data appear 
within text of reports, in tables embedded within text, and (sometimes) in 
spreadsheet files posted with presentations.  Here are some textual data from 
AT&T's 3Q 2010 report:

    More than 8 million postpaid integrated devices were activated in the third quarter, the most quarterly activations ever. More than 80 percent of postpaid sales were integrated devices.

These data don't have a nice, regular, tabular form.  If you combine that data 
with data from the accompanying spreadsheets, the resulting table isn't pretty. 
 It gets even more badly structured when you add human-generated data from 
additional companies.

Humans typically generate idiosyncratic data presentations.  More powerful data 
tools allow persons to create a greater number and variety of idiosyncratic 
data presentations from well-structured, well-defined datasets.   One might 
hope that norms of credibility evolve to encourage data presenters to release 
the underlying, machine-queryable dataset along with the idiosyncratic 
human-generated presentation.  But you can think of many reasons why that often 
won't happen.

Broadly collecting and organizing human-generated data tends to produce badly 
structured tables.  No two persons generate exactly the same categories and 
items of data.  Data persons present change over time.   The result is a wide 
variety of small data items and tables. Combining that data into one badly 
structured table makes for more efficient querying and analysis.   As painful 
as this situation might be for thoughtful data modelers, badly structured 
tables have a bright future.

*  *  *  *  *

[*] Of course the real world is finite.  A method with marginal cost that 
increases linearly with job size pushes against a finite world much sooner than 
a method with constant marginal cost.   The above thought experiment is meant 
to offer insight, not a proof of a real-world universal law.

Original comment by galbith...@galbithink.org on 19 Dec 2010 at 7:09