Import large dataset
-
New data on imports and exports turns map of carbon emission on its head
[Guardian] (Environment news, comment and analysis from the Guardian | guardian.co.uk)Study shows massive rise in the carbon footprint of traded goods, with the rich world importing far more than it exports • Carbon cuts by developed countries cancelled out by imports • FAQ: Which nations are most responsible for climate change? • FAQ: What are 'outsourced emissions'? • Get the dataEarlier this week I reported on some groundbreaking research on the carbon embodied in global trade flows. Of course, we've known for years that the carbon footprint of goods imported from ...
Study shows massive rise in the carbon footprint of traded goods, with the rich world importing far more than it exports
• Carbon cuts by developed countries cancelled out by imports
• FAQ: Which nations are most responsible for climate change?
• FAQ: What are 'outsourced emissions'?
• Get the dataEarlier this week I reported on some groundbreaking research on the carbon embodied in global trade flows. Of course, we've known for years that the carbon footprint of goods imported from emerging economies such as China was counteracting some of the apparent emissions cuts made within developed countries – in other words, that the rich world has been 'offshoring' or 'outsourcing' its emissions.
But the new data allows us for the first time to get a global view of emissions flows going right back to 1990, the baseline year for the Kyoto Protocol. The headline conclusions are startling: while official figures suggest that CO2 emissions within developed nations have fallen by 2% or 258 million tonnes (MT), once imported goods are added and exported goods are subtracted the true change is an increase of nearly 7% or 1607 MT.
Admittedly, much of the rise is accounted for by the US, which dropped out of the Kyoto Protocol before ratification. But even those nations which did ratify – and which have cut 1326 MT of CO2 since 1990 – look much less rosy in the context of 1128 MT of net imports in the form of goods and food. For some countries the picture is even starker, with the UK's 5% (28 MT) reduction becoming a rise of a 16% (102 MT) when trade is taken into account.
Although the overall picture is of a massive transfer of carbon from the poor world to the rich world, many countries buck the trend. After China, with its massive net exports of 1329 MT of CO2, the next biggest carbon exporter is Russia with 281 MT. Other net exporters in the developed world include Ukraine, Australia and Poland, whose national footprints drop 18%, 16% and 14% respectively when imports and exports are factored into the equation.
Similarly, although the very biggest carbon importers – such as the US, Germany, Japan, the UK, France and Italy – are all in the developed world, a large number of developing countries, from Mexico and Paraguay to Tanzania, import more carbon than they export. Mozambique's footprint rises by a remarkable 172% when trade is taken into account.
The data also highlights the significance of trade in different types of products. Here are the top five sources of imported carbon overall:
- Machinery and equipment from China. 144 MT
- Metals from Russia. 115 MT
- Chemical, rubber and plastic products from the US. 108 MT
- Chemical, rubber and plastic products from China. 107 MT
- Electronic equipment from China. 99.6 MT
And some example bilateral flows:
- Machinery and equipment from China to the US. 36.3 MT
- Gas from Canada to the US. 27.4 MT
- Metals from Russia to the Germany. 18.8 MT
- Motor vehicles and parts from Japan to the US 13.0 MT
- Chemical, rubber and plastic products from China to Japan 11.6 MT
One unexpected finding in the research is that the majority of the emissions embodied in international trade flows now take the form of 'non-energy-intensive' goods – anything from toys and computers to cars – rather than 'energy-intensive' goods such as steel and other raw materials. This suggests that it may be harder than previously thought to monitor carbon in trade flows on an ongoing basis. After all, it's far more difficult to estimate the carbon footprint of a shipload of consumer goods than it is of a shipload of steel.
The importance of non-energy-intensive goods also suggests that some of the solutions that have occasionally been proposed for stemming offshored emissions – such as border taxes on carbon-intensive raw materials – would be unlikely to work, and may even have the opposite of the intended effect by boosting international trade in manufactured goods.
When I asked Decc about the new data, Energy and Climate Change Secretary Chris Huhne gave the following statement (too late for the original news story):
"I am proud of the UK's leadership in reducing emissions within our own borders and in promoting a global deal on climate change to measure and reduce emissions across the entire world. This is the best way to reduce emissions from imported goods."
The problem, of course, is that until the world starts to understand and recognise the importance of carbon in traded goods, it may be harder to actually achieve a global deal, as countries like China will legitimately argue not only that they have produced a relatively small amount of carbon per person over the last hundred years, but also that their rising emissions are significantly driven by producing goods for richer nations. That's why this new global dataset is so important.
Lead researcher Glen Peters claims that, despite the challenges, meaningful levels of monitoring for carbon in trade flows could be achieved, and that this should help the world reach a strong global deal on emissions reductions. "We argue an important first step is for countries to regularly report emissions from the production of internationally traded products and that including this data into climate negotiations may help facilitate a more robust agreement."
I've made the whole dataset available in this Google Spreadsheet. Sheets 5, 6 and 7 contain the core data, and I've added sheet number 12 which draws together what I think are the key figures for each country.
Data summary
Download the data
• DATA: download the full spreadsheet
More data
Data journalism and data visualisations from the Guardian
World government data
• Search the world's government data with our gateway
Development and aid data
• Search the world's global development data with our gateway
Can you do something with this data?
• Flickr Please post your visualisations and mash-ups on our Flickr group
• Contact us at data@guardian.co.uk• Get the A-Z of data
• More at the Datastore directory
• Follow us on Twitter
• Like us on Facebook
guardian.co.uk © Guardian News & Media Limited 2011 | Use of this content is subject to our Terms & Conditions | More Feeds -
Dabo 0.9.3 - 3-tier desktop app framework for Python.. (Free)
[Macintosh] (MacUpdate: Recent Mac OS X)Dabo is a 3-tier, cross-platform application development framework, written in Python atop the wxPython GUI toolkit. And while Dabo is designed to create database-centric apps, that is not a requirement. Lots of people are using Dabo for the GUI tools to create apps that have no need to connect to a database at all. Desktop applications. That's what Dabo does. It's not YAWF (yet another web framework). There are plenty of excellent web frameworks out there, so if that's what you are looking for ...
Dabo is a 3-tier, cross-platform application development framework, written in Python atop the wxPython GUI toolkit. And while Dabo is designed to create database-centric apps, that is not a requirement. Lots of people are using Dabo for the GUI tools to create apps that have no need to connect to a database at all.Desktop applications. That's what Dabo does. It's not YAWF (yet another web framework). There are plenty of excellent web frameworks out there, so if that's what you are looking for, Dabo isn't for you. But there are almost no desktop application frameworks out there, and if you want to create applications that run on Windows, OS X or Linux, Dabo is for you!
Version 0.9.3:- Cleaned up the code base by removing all trailing whitespace.
- Removed the ref to md5 and replaced with hashlib. haslib is currently already being used in dDataSet.py and md5 has been deprecated (those users of python 2.4 or less need to install the hashlib.py)
- Added dabo.lib.reportUtils.printPDF(), and added a print button to FrmReportBase in the AppWizard generated code.
- On Windows, where the status text can go to the main frame, hidden forms were writing their current record text to the status bar on update. This fixes that awkwardness.
- Eliminated (or at least greatly reduced) the grid header flickering on Windows.
- Fixed a bug reported by Mark Rajcok in which the WordWrap property grid columns was only affecting those with str DataType, and was ignoring unicode DataType.
- Switched all of Dabo's logging to use standard Python logging.
- Switched from os.system() and os.popen2() to subprocess.call() in previewPDF() and printPDF(). Removes DeprecationWarnings in Python 2.6 and above.
- Added some logic to prevent infinite loops when using field-level validation.
- Added a fix so that previewing and printing should be modeless by default.
- Updated the internal cache code to handle permissions better when running under mod_wsgi.
- Added the 'SelectedText' and 'Text' read-only properties to dHtmlBox.
- Added the reports folder to the resolvePathAndUpdate method in the utils module. Added a conditional check in the reportWriter to call the resolvePathAndUpdate method if the path is a valid absolute path. So, now you can specify the ReportFormFile property of the ReportWriter object is a relative path.
- Reverts a behavioral change introduced accidentally in r5846. Now scan() will requery child bizobjs by default.
- Created the 'ustr()' method in dabo.lib.utils. This is designed to replace all of our calls to str() in order to eliminate the unicode encoding errors that pop up frequently when non-American developers use Dabo. To use, include the line:
from dabo.lib.utils
import ustrin your import statements, and then replace all instances of str(val) with ustr(val).
- Added some sizer outline code to dFormMixin that was inadvertantly left out of the last Web Update.
- dRichTextBox: Changed the 'InsertionPoint' property to 'InsertionPosition' to be consistent with other editing controls.
- dRichTextBox: Renamed the 'loadFromFile()' and 'saveToFile()' methods to 'load()' and 'save()', respectively, as they can use any file-like object.
- Added the dRichTextBox class, which allows for basic rich text editing and display.
- Fixed a bug in Web Update that prevented apps from recognizing that Web Update had not yet been run for that Dabo installation.
- Changed the PreferenceDialog to use a basic dPageFrame instead of dPageList, due to wx warnings.
- Fixed a potential problem in list controls where the control could try to access its value before the correct DataSource had been set.
- Added code to reduce dGrid flickering under Windows.
- Added the ImageRenderer class to display images in grid cells.
- The code that handles dropped text/files now preserves the x,y location of the drop so that you can tell where on the control the user dropped.
- Added a textbox-level NoneDisplay property so that individual text box controls can determine how they display None values instead of all controls using dApp.NoneDisplay.
- Fixed an incompatibility with a recent wxPython change to the foldpanelbar class.
- Several visual improvements to the dPageStyled class.
- Fixed a bug in dSlider that prevented reversed display.
- Fixed some inheritance issues with sizers.
- Improved sizer outlining to be more flexible. This is mostly for app design uses.
- Added support for the dPageStyled class to the Class Designer.
- Added visual indication of sizers in the Class Designer when a sizer is the selected object.
- Added the HomeDirectoryStatusBar to the visual tools to display your current app's HomeDirectory.
- Updated the standard directory structure for Dabo apps to include 'cache' and 'lib' directories.
- Fixed some import errors in reportWriter.py
- Added an optimization to dDataSet that improves performance when doing multiple queries against the same data.
- Added an optional optimization to dBizobj that avoids having to requery child bizobjs during a scan().
- Fixed a bug in the JsonConverter imports.
- Corrected the bizobj isChanged() function to reflect new records as well as modified records.
- Added dApp.AutoImportConnections property. When False, dApp will skip the process of finding and loading dConnectInfo objects from found cnxml files. My app is being enhanced to use cnxml files, but I use my own logic on which cnxml file to use.
- Augmented biz.getTempCursor() with some optional arguments, allowing the appdev to set sql, params, and have automatic requerying before returning the cursor reference.
- dSlider: added the Reversed and TickPosition properties, as requested by Mike Mabey.
- Finally fixed a dMenuItem bug that's been bothering me for (I think) years now. On Windows, sometimes there would be double captions, and the "Close Window" item in the File menu would be corrupted. Apparently, the timing of calling SetBitmap() is crucial: it must happen before SetText().
- Fixed the language problems withthe code to find menus in AppWizard applications.
- dDateTextBox: allows the user to clear the date (set to None) by adding a shortcut ('N')
- dReportWriter: Fixed bug with spanning objects: if the group didn't print for whatever reason, the spanning never started. Therefore, we can't try to draw the object.
- Fixed a bug in dGrid's incremental search on Windows.
- Refactored the 'resolvePathAndUpdate()' method into dabo.lib.utils instead of dabo.ui.uiwx
- Added window scrolling events to dGrid and dScrollPanel, so that your code can now handle them if needed.
- Added optional argument to cur.execute() to convert any ?'s to the backend's paramPlaceholder.
- Made the localization installation process a little more sane, as it seems that it is especially prone to errors. Now, instead of abending when the dabo localization file isn't found, it prints an error and continues. The app will continue to work fine, but no translations will be done.
- Fixed a bug that prevented boolean values in grid columns from being properly restored. Trac issue #1247.
- Added a flag to avoid an unnecessary pointer movement caused by setting the DataSource of the grid to a bizobj after the bizobj had already been created and had its record pointer set. Trac issue #1314.
- The logic for constructing the filtering WHERE clause in child bizobjs has been corrected to by fully paramterized, instead of 'injecting' the value directly into the SQL.
- Fixed a problem when re-opening designs for custom classes. Reported by Martinecz Miklós.
- Added the 'GridCellEditEnd' event, which is raised when a grid cell editor is hidden, whether the value has been changed or not.
- Fixed a potential issue in the Class Designer Property Sheet in which you could be editing the value of a property of one control, and then navigate to a different control and have your change accidentally be applied to the second control.
- Fixed the Crypto property so that setting it to a crypto object will result in the encrypt() and decrypt() methods using that object.
- Fixed a bug in the filterByExpression() method that only replace the first occurrence of a field name in the expression. Reported by Ricardo Aráoz.
- Changed the Face setter to ignore attempts to set it to 'MS Shell Dlg*' font face names, which can happen when a cdxml file created on Windows is opened on Mac or Linux.
- Changed the HomeDirectory setter to write an errorLog entry instead of throwing an exception when an invalid path is passed. Again, this is an issue with moving a cdxml from one system to another.
- Changed the behavior when attempting to set the Face to a non-existent fontface. Instead of throwing an error, an entry is written to the Dabo error log describing the issue.
- Updated dLed to make it data-aware. It can now be bound to a DataSource and DataField, and have its Color reflect the underlying boolean value.
- Revamped the handling of pathing. If you have file path references in your cdxml or cnxml files, this could break your old files. Pathing is now relative to the HomeDirectory of your app, instead of the location of the tool that created the file.
- Updated the internal encryption code to support the use of DES3 cryptography if you have the PyCrypto package installed. There is also a write-only property of dApp called 'CryptoKey': when you set that property, if you have PyCrypto installed, it will create a new SimpleCrypt instance that uses DES3 encryption, with that value as the encryption key. And to avoid having to pass a plain-text value to the app creation, you can set this property at any time, passing either a string or any callable. By passing a callable, you can more effectively hide the way your key is stored; the actual key never has to appear in your code at all. If you have cnxml files created with the old code, they may no longer work. Simple re-run the CxnEditor to re-save them with the new encryption.
- Added the UIAppClass property to dApp. It can be set to a custom subclass of dabo.ui.uiApp, to enable developers to add ui toolkit-specific behaviors.
- Fixed an issue with the cursor's _mementos attribute getting inadvertantly altered by record cloning. Trac #1316
- Overrode the removeAll() method of this control to work properly with the underlying wx control. Trac #1308
- Added some code to ensure a minimal wx version.
- Fixed the problem where the cursor won't save a new unchanged record even if the bizobj is set to SaveNewUnchanged.
- Fixed a problem in the seek() method when called from the bizobj when the table has a compound pk. Also improved the algorithm for matching and near-matching. Trac issue #1330.
- Major fix to how resizing is handled for panels. Previously, there were cases where panels would get "stuck" at a large size and were not able to be resized smaller. This should fix that problem. Also added the 'Square' property. When set to True, the panel will confine itself to a square shape.
- If a dBizobj.filter() call filters out all records, a NoRecordsException was being raised. This is generally not necessary, so this is now caught. Trac #1331
- Added code to handle the case in dGrid where a case-insensitive search is being done on a string type column with null values.
- Corrects a problem identified by Jacek Kałucki in which some columns that don't have defined data types only get their type corrected the first time a query is run.
- Fixed a bug that only happened if you called the layout() method of the StatusBar. The attribute '_platformIsWindows' had been previously defined in dPanel, but dStatusBar no longer inherits from that.
- Fixed getCaptureBitmap() to use a WindowDC for panels and dialogs instead of the parent's ClientDC.
- Added the StatusBarClass property, which will allow a form to create any status bar subclass that is needed. Defaults to the standard dStatusBar.
- Fixed a problem when using direct object references as the DataSource, discovered by Jacek Kałucki.
- Added biz.hasPK() method. It answers the question "is this PK value present in the dataset?" It doesn't move the pointer or have any side-effects, and is optimized to return the answer to this question as fast as possible.
- Fixed a bug in cdxml rendering: if it contained a boolean value False, it was stored as the word 'False'. But when the attProperties were being read back in, we were doing bool(val) to convert it back, and bool("False") is True.
- Added a bit of verbosity to quickStart(). Also changed it so that it requires an app name.
- Deprecated ShowColumnLabels; added ShowHeaders. Fixed ShowHeaders and HeaderHeight to not have side-effects on each other.
- Fixed some display issues when the grid has at least one column with Expand=True.
- Made history searches in the Command Window case-insensitive.
- Fixed an issue when opening up the preferences dialog for the first time: since no update frequency preference had been set, an error was raised when the frequency radio list was bound to that pref.
- Changed the order in which we import the app subdirectory modules. This is necessary when using a customized uiApp subclass, which would typically be added as a class in the ui module, and then used by overriding the app creation line: app = App(SourceURL=remotehost, UIAppClass=ui.MyCustomUIClass)
- Fixed a typo that caused an exception when pressing when a report object was selected. Now it brings up the default property in the property sheet like was intended.
- Corrected an error when saving/running a non-sizer-based form.
- Enhanced copy/paste in the report designer: if objects from multiple bands are selected when copied, paste them into the same bands instead of the currently selected band. Allows for copy/pasting a header and a detail item, for example.
- The recent change to make dImage a data-aware control had an unintended effect in the Class Designer: saving an image resulted in not only saving the path, but also the complete byte stream of that image in the Value property. This fixes that oversight.
- Added build scripts for Linux to AppWizard. Now you can 'buildwin', 'buildmac' *and* 'buildlin'.
- Added support for the pudb debugger if it is installed.
- Improved the flow when controls have their value changed from an update() call. Now the flushValue() is no longer called, which avoids unnecessary data validation calls.
- Fixed some issues with establishing HomeDirectory when running from the runtime engine.
- Restored the call to flushValue() that was removed in the recent changes to dSpinner. It was breaking anything that used a data-bound spinner, especially the Class Designer.
- Corrected the issues raised by Jacek Kałucki regarding the navigation problems of dComboBox under Windows.
- Made some changes designed to better support images as data-bound controls. Also allowed for the image's Picture property to accept a wx.Bitmap as the image source, as that's what is returned from the clipboard methods.
- Added the convenience methods dabo.ui.copyToClipboard() and dabo.ui.getFromClipboard(). They work with both text and bitmap types.
- Added a menu option to the Dabo Editor for toggling whitespace visibility.
- Fixed the download_url in setup.py to match the current download location. I guess easy_install has been broken for a while. Thanks Carl Karsten for finding and reporting the problem, along with a solution!
- Fixed setup.py to work with pip, per Carl Karsten.
- Fixed a problem in the SQL generation of AppWizard-generated apps when using full text searches.
- Fixed the issues with custom classes in the Class Designer not being properly inherited when running inside other class designs. The problem came down to pathing, as have most similar issues, so I approached it by adding some smarter pathing code.
- Consolidated the logic for app standard directories. Recent changes had not kept the different locations where they were referenced in sync, so now there is an attribute of dApp called '_standardDirs' that is a tuple of the subdirectory names. There is also now a method of dApp called 'getStandardDirectories()' that will return a tuple containing the HomeDirectory and the full paths to all these subdirectories.
- Enhanced the Editor app by adding an option for setting the number of characters before AutoAutoComplete fires, and another option for toggling whether line numbers are visible.
- Added some events to dReportWriter: ReportCancel, ReportBegin, ReportEnd, and ReportIteration. Your code can bind to them like any other Dabo event.
- Fixed an issue when setting the DataSource to a dPref object.
- Wrapped the creation of the status bar to hopefully not give artifacts when created too close to form creation/resize time.
- Updated dSecurityManager and added dApp.LoginDialogClass.
- Fixed color settings in dLed.
- Added the EditorStyleNeeded event to support custom styling in dEditor.
- Fixed an occasional problem with dObject.__repr__() calls.
- Removed code that was slowing searches on virtual fields.
- Implemented full parameter passing to backend SQL instead of using string substitution.
- Added a lib directory into the standard Dabo structure.
- Fixed datanav Form to not keep the edit page active after a delete or cancel if that action resulted in the RowCount going to 0.
- Mac OS X 10.3 or later.
- Python 2.4 or higher (2.5 recommended).
- wxPython 2.6 or higher (2.8 highly recommended).
- Also, MySQLdb (for MySQL).
- kinterbasdb (Firebird).
- pysqlite (SQLite) or psycopg (PostgreSQL) is required for the database-releated demos to work.
Download Now -
Blog Post: Loading data to SQL Azure the fast way
[Careers] (Site Home)Introduction Now that you have your database set up in SQL Azure, the next step is to load your data to this database. Your data could exist in various sources; valid sources include SQL Server, Oracle, Excel, Access, flat files and others. Your data could exist in various locations. A location might be a data center, behind a corporate firewall, on a home network, or even in Windows Azure. There are various data migration tools available, such as the SQL Server BCP Utility, SQL Server Inte ...
Introduction
Now that you have your database set up in SQL Azure, the next step is to load your data to this database. Your data could exist in various sources; valid sources include SQL Server, Oracle, Excel, Access, flat files and others. Your data could exist in various locations. A location might be a data center, behind a corporate firewall, on a home network, or even in Windows Azure.
There are various data migration tools available, such as the SQL Server BCP Utility, SQL Server Integration Services (SSIS), Import and Export Data and SQL Server Management Studio (SSMS). You could even use the Bulk Copy API to author your own customized data upload application. An example of one such custom data migration application that is based on BCP is the SQL Azure Migration Wizard.
In this blog we discuss
1. Tools you have to maximize data upload speeds to SQL Azure
2. Analysis of results from data upload tests
3. Best Practices drawn from the analysis to help choose the option that works best for you
Choose the right Tool
Here are some popular tools that are commonly used for bulk upload.
BCP: This is a utility available with the SQL command line utilities that is designed for high performing bulk upload to a single SQL Server/Azure database.
SSIS: This is a powerful tool when operating on multiple heterogeneous data sources and destinations. This tool provides support for complex workflow and data transformation between the source and destination.
In some cases it is a good idea to use a hybrid combination of SSIS for workflow and BCP for bulk load to leverage the benefits of both the tools.
Import & Export Data: A simple wizard that does not offer the wide range of configuration that SSIS provides, but is very handy for schema migration and smaller data uploads.
SSMS: This tool has the option of generating SQL Azure schema and data migration scripts. It is very useful for schema migration, but is not recommended for large data uploads.
Bulk Copy API: In the case where you need to build your own tool for maximum flexibility of programming, you could use the Bulk Copy API. This API is highly efficient and provides bulk performance similar to BCP.
Set Up
To standardize this analysis, we have chosen to start with a simple flat-file data source with 1GB of data and 7,999,406 rows.
The destination table was set up with one clustered index. It had a size of 142 bytes per row.
We have focused this analysis on the two distinct scenarios of having data located inside and outside Windows Azure.
After sampling the various tools, we have identified BCP and SSIS as the top two performing tools for this analysis. These tools were used under various scenarios to determine the setup that provides fastest data upload speeds.
When using BCP, we used the –F and –L options to specify the first and last rows of the flat file for the upload. This was useful to avoid having to physically split the data file to achieve multiple stream upload.
When using SSIS, we split source data into multiple files on the file system. These were then referenced by Flat File Components in the SSIS designer. Each input file was connected to a ADO .Net Component that had the Use Bulk Insert when possible flag checked.
Approach
SQL Azure must be accessed from local client tool over the Internet. This network has three properties that impact the time required to load data to SQL Azure.
· Latency: The delay introduced by the network in getting the data packets to the server.
· Bandwidth: The capacity of the network connection.
· Reliability: Prone to disconnects due to external systems.
Latency causes an increase in time required to transfer data to SQL Azure. The best way to mitigate this effect is to transfer data using multiple concurrent streams. However, the efficiency of parallelization is capped by the bandwidth of your network.
In this analysis, we have studied the response of SQL Azure to concurrent data streams so as to identify the best practices when loading data to SQL Azure.
Results & Analysis
The chart below shows the time taken to transfer 1GB of data to a SQL Azure table with one clustered index.
The columns are grouped by the data upload tool used and the location of the data source. In each grouping we compare the performance of single versus multiple streams of data.
From the results we observed the fastest transfer time when loading data from Windows Azure to SQL Azure. We see that using multiple streams of data clearly improved the overall usage of both tools. Moreover, using multiple streams of data helped achieve very similar transfer times from both outside and inside Windows Azure.
BCP allows you to vary the batch size (number of rows committed per transaction) and the packet size (number of bytes per packet sent over the internet). From the analysis it was evident that although these parameters can greatly influence the time to upload data, their optimum values depend on the unique characteristics of your data set and the network involved.
For our data set and network that was behind a corporate firewall
Tool
Observation
BCP
Best performance at 5 streams, with a batch size of 10,000 and default packet size of 4K.
SSIS
Best performance at 7 streams. We had the Use bulk upload when possible check box selected on the ADO .NET destination SQL Azure component.
Best Practices for loading data to SQL Azure
· When loading data to SQL Azure, it is advisable to split your data into multiple concurrent streams to achieve the best performance.
· Vary the BCP batch size option to determine the best setting for your network and dataset.
· Add non clustered indexes after loading data to SQL Azure.
o Two additional indexes created before loading the data increased the final database size by ~50% and increased the time to load the same data by ~170%.
· If, while building large indexes, you see a throttling-related error message, retry using the online option.
Appendix
Destination Table Schema
CREATE TABLE LINEITEM
(L_ORDERKEY bigint not null,
L_PARTKEY int not null,
L_SUPPKEY int not null,
L_LINENUMBER int not null,
L_QUANTITY float not null,
L_EXTENDEDPRICE float not null,
L_DISCOUNT float not null,
L_TAX float not null,
L_RETURNFLAG char (1) not null,
L_LINESTATUS char (1) not null,
L_SHIPDATE date not null,
L_COMMITDATE date not null,
L_RECEIPTDATE date not null,
L_SHIPINSTRUCT char (25) not null,
L_SHIPMODE char (10) not null,
L_COMMENT varchar (44) not null);
CREATE CLUSTERED INDEX L_SHIPDATE_CLUIDX ON LINEITEM (L_SHIPDATE);
CREATE INDEX L_ORDERKEY_IDX ON LINEITEM (L_ORDERKEY);
CREATE INDEX L_PARTKEY_IDX ON LINEITEM (L_PARTKEY);
Using the TPC DbGen utility to generate test Data
The data was obtained using the DbGen utility from the TPC website. We generated 1 GB of Data for the Lineitem table using the command dbgen –T L -s 4 -C 3 -S 1.
Using the –s option, we set the scale to 4 that generates a Lineitem table of 3GB. Using the –C option we split the table into 3 portions, and then using the –S option we chose only the first 1GB portion of the Lineitem table.
Unsupported Tools
The Bulk Insert T-SQL statement is not supported on SQL Azure. Bulk Insert expects to find the source file on the database server’s local drive or network path accessible to the server. Since the server is in the cloud, we do not have access to put files on it or configure it to access network shares.
Lubor Kollar and George Varghese
-
Blog Post: Hey, Scripting Guy! How Can I Use Windows PowerShell to Identify Inactive User Accounts in Active Directory Domain Services?
[Windows] (Site Home)Hey, Scripting Guy! I need to use Windows PowerShell to identify inactive user accounts in Active Directory Domain Services (AD DS). I used to have a VBScript script that I would use, but I would like to be able to use Windows PowerShell 2.0 and the new Active Directory cmdlets that come with Windows Server 2008 R2. Is this something that can easily be accomplished? -- GJ Hello GJ, Microsoft Scripting Guy Ed Wilson here. I believe the weather person made a mistake. In fact, I am nearly posi ...
Hey, Scripting Guy! I need to use Windows PowerShell to identify inactive user accounts in Active Directory Domain Services (AD DS). I used to have a VBScript script that I would use, but I would like to be able to use Windows PowerShell 2.0 and the new Active Directory cmdlets that come with Windows Server 2008 R2. Is this something that can easily be accomplished?
-- GJ
Hello GJ,
Microsoft Scripting Guy Ed Wilson here. I believe the weather person made a mistake. In fact, I am nearly positive. The weather forecast for the entire week is exactly the same as yesterday’s weather forecast—hot, humid, and a chance of afternoon thundershowers. Dude, Craig and I could never get away with posting the same Hey, Scripting Guy! Blog post for seven days in a row. Where is the weather guesser’s manager? Where is the person in charge of quality control? Who monitors the incoming email for the department of redundancy department? They never make a mistake like this when the weather is mild and sunny with low humidity—it is not fair!
GJ, luckily we have air conditioning, and because it is too hot and humid to be outside in my woodworking shop, I decided to come in and check the email sent to scripter@microsoft.com. By using the Active Directory cmdlets that come with Windows Server 2008 R2, it is easy to query for information about user accounts. The Get-ADUser Windows PowerShell cmdlet is fairly intuitive and actually quite fun to use.
The first thing you will need to do is to import the ActiveDirectory module into the current Windows PowerShell session. To quickly obtain a listing of all the users in Active Directory, supply a wildcard character to the -Filter parameter of the Get-ADUser cmdlet, as shown in the following image.
If you wish to change the base of the search operations, use the SearchBase parameter. The SearchBase parameter accepts an LDAP style of naming. The following command changes the search base to the hsg_TestOU:
Get-ADUser -Filter * -SearchBase "ou=hsg_TestOU,dc=nwtraders,dc=com"When using the Get-ADUser cmdlet, only a certain subset of user properties is displayed (10 properties, to be exact). These properties will be displayed when you pipe the results to Format-List and use a wildcard character and the -Force parameter as shown here:
PS C:[___DESCRIPTION___]gt; Get-ADUser -Identity bob | format-list -Property * -Force
DistinguishedName : CN=bob,OU=HSG_TestOU,DC=NWTraders,DC=Com
Enabled : True
GivenName : bob
Name : bob
ObjectClass : user
ObjectGUID : 5cae3acf-f194-4e07-a466-789f9ad5c84a
SamAccountName : bob
SID : S-1-5-21-3746122405-834892460-3960030898-3601
Surname :
UserPrincipalName : bob@NWTraders.Com
PropertyNames : {DistinguishedName, Enabled, GivenName, Name...}
PropertyCount : 10
PS C:[___DESCRIPTION___]gt;Anyone who knows very much about Active Directory Domain Services (AD DS) knows there are certainly more than 10 properties associated with a user object. Does this mean we need to use the Get-ADObject cmdlet that was examined yesterday?
If I try to display a property that is not returned by the Get-ADUser cmdlet, such as the whenCreated property, an error is not returned. The value of the property is not returned. This is shown here:
PS C:[___DESCRIPTION___]gt; Get-ADUser -Identity bob | Format-List -Property name, whenCreated
name : bob
whencreated :I used the whenCreated property for the user object because I know that it has a value. But suppose I was looking for users that had never logged onto the system? Suppose I used a query such as the one shown here, and I was going to base a delete operation upon the results. The results could be disastrous.
PS C:[___DESCRIPTION___]gt; Get-ADUser -Filter * | Format-Table -Property name, LastLogonDate
name LastLogonDate
---- -------------
Administrator
Guest
krbtgt
testuser2
ed
SystemMailbox{1f05a927-a261-4eb4-8360-8...
SystemMailbox{e0dc1c29-89c3-4034-b678-e...
FederatedEmail.4c1f4d8b-8179-4148-93bf-...
HSG_Test
HSG_TestChild
<results truncated>To retrieve a property that is not a member of the default 10 properties, you must select it by using the –property parameter. The reason that Get-ADUser does not automatically return all properties and their associated values is because of performance reasons on large networks—there is no reason to return a large dataset when a small dataset will perfectly suffice. To display the name and the whenCreated date for the user named “bob,” the following command can be used:
PS C:[___DESCRIPTION___]gt; Get-ADUser -Identity bob -Properties whencreated | Format-List -Property name
, whencreated
name : bob
whencreated : 6/11/2010 8:19:52 AM
PS C:[___DESCRIPTION___]gt;To retrieve all of the properties associated with a user object, use the wildcard character “*” for the Properties parameter value. You would use a command similar to the one shown here:
Get-ADUser -Identity bob -PropertiesThe results of this command are shown in the following image.
To produce a listing of all the users and their last logon date, you can use a command similar to the one shown here. This is a single command that might wrap the line, depending on your screen resolution:
Get-ADUser -Filter * -Properties "LastLogonDate" |
sort-object -property lastlogondate -descending |
Format-Table -property name, lastlogondate -AutoSizeThe output produces a nice table that is shown in the following image.
GJ, that is all there is to using Active Directory cmdlets to work with user objects. Active Directory Week will continue tomorrow when we will talk about working with computer objects.
We invite you to follow us on Twitter or Facebook. If you have any questions, send email to us at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.. See you tomorrow. Until then, peace.
Ed Wilson and Craig Liebendorfer, Scripting Guys
-
On hot backups
[Programming] (Planet MySQL)Few years ago I was looking at crash recovery code, and realized that InnoDB has removed all the comments from the code, related to replay of transaction log. Judging by high quality of comments in the remaining codebase, I realized that it was all done to obscure any efforts to build another InnoDB hot backup solution – competitor to first Innobase standalone offering. I was enjoying the moment when Percona launched their own implementation of the tool. Since the inception, it became more ...
Few years ago I was looking at crash recovery code, and realized that InnoDB has removed all the comments from the code, related to replay of transaction log. Judging by high quality of comments in the remaining codebase, I realized that it was all done to obscure any efforts to build another InnoDB hot backup solution – competitor to first Innobase standalone offering. I was enjoying the moment when Percona launched their own implementation of the tool. Since the inception, it became more and more robust and feature rich. We have used xtrabackup in our environment a lot – just… not for backup – the major use case right now is for cloning server instances – either for building new replicas, shadow servers, or replacing masters – and allows us to do that without interrupting any operation. Now what makes the whole situation way more interesting – Oracle/MySQL announced at the conference, that InnoDB Hot Backup will be part of the Enterprise offering – which makes it way more available to MySQL customer community, than when it required quite expensive per-server licenses. Of course, open source xtrabackup is way easier to tweak for our environment (O_DIRECT support, posix_fadvise(), flashcache hints, etc – was all added after release) – and it is interesting, how Oracle-provided tool will evolve. Right now xtrabackup already supports streaming operation, which makes it much more usable in large-database-on-small-hardware (read: sharded) environments, and provides flexibility to the users. Oracle of course owns much more of in-house expertise of both current internals operation, as well as all the future changes that will happen, so we may see leadership in the field coming from their side too. One of our reasons for not using physical backup solution is simply that it is not space efficient. There may be multiple ways to approach that – from robust incremental backups, to partial backups, that wouldn’t include secondary indexes or have limited set of tables taken. Some changes may actually require extended MySQL/InnoDB support – on multiple-terabyte instances one may not want to rescan the dataset for each incremental backup run – as resulting diff would be just a hundred gigabytes or less. This would require support for always-running backup agent that would aggregate information about block changes and allow for more efficient backup operation. Discarding secondary indexes is way more attractive option with 5.1/Plugin ability to do fast independent index builds, that don’t require one row at a time B-Tree builds for all indexes at once (and of course, hit severe memory penalties on large tables or in parallel workloads). Having always ready backups is important not only for ability to rebuild a box (and we have replicas for machine failures) – the real value is when backups can be used for massive-scale thousands of machines subset of table rows extraction. For that one cannot just ship full instance data around from backup storage – so recovery tools will have to be way flexible. Probably core feature for that kind of operation would be ability to import tables directly from hot backup to online instances – unfortunately, restarting database instance is still costly (though we’re doing quite some work in that direction too). I’m extremely happy that InnoDB started fixing operational issues like crash recovery performance, but there’s still a wide area of problems not touched properly yet – extremely in disaster recovery space, and I’m eager to see developments in this field – both from Oracle, and community members. -
Topo USA 8.0 National Edition
[Africa] (Afrigator)Up-to-date, feature-rich topographic software works on PCs as a stand-alone product or in conjunction with Delorme GPS unitsUpdated highway, street, back road, and trail detail, including highways and streets for Canada and major roads for MexicoPaperless geocaching with DeLorme Earthmate PN-Series GPS receivers--easily import, export, manage, and update your geocache filesThe most current available USGS terrain and land cover data, with realistic 3-D terrain views, with flyovers and 360-degree ...
Up-to-date, feature-rich topographic software works on PCs as a stand-alone product or in conjunction with Delorme GPS unitsUpdated highway, street, back road, and trail detail, including highways and streets for Canada and major roads for MexicoPaperless geocaching with DeLorme Earthmate PN-Series GPS receivers--easily import, export, manage, and update your geocache filesThe most current available USGS terrain and land cover data, with realistic 3-D terrain views, with flyovers and 360-degree rotationVoice navigation for PCs, UMPCs, touchscreen phones, and PDAs $39.99 Product DescriptionAmericas Most Up-to-Date, Feature-Rich Topographic Software. In addition to DeLormes up-to-date terrain, road, and points of interest detail, Topo USA provides access to downloadable aerial imagery, NOAA nautical charts, and authentic USGS 1:24,000 quad maps. Integrate your data downloads with Topo USA 8.0 on your PC for unrivaled planning, routing and navigating. Unlike online map and imagery sources, DeLorme data downloads can be integrated with Topo USA 8.0 maps. Use the split screen editing capability to identify and mark structures, landmarks, navigation markers, new roads and trails, and other features with pinpoint GPS accuracy. Topo USA also enables realistic 3-D terrain views with flyovers, as well as the unique ability to create routes automatically over both roads and trails. The rich level of detail includes land use, land cover, public lands (e.g., BLM, national and state parks and forests), and more than 4 million places of interest.Amazon.com Product DescriptionScout your destination as if you were there, with up-to-date terrain, trail, and road detail. Explore or fly over realistic 3-D. Import aerial imagery. Route your travels over roads and trails. Find public recreation lands. See elevation profiles. Customize and print maps at a wide range of scales. Topo USA 8.0 does it all. The Most Complete & Best Value Mapping Software for Recreation Topo USA 8.0 is the most comprehensive computer mapping program for outdoor recreation, with unsurpassed maps, available imagery, trip planning features, on-road navigation, and GPS capabilities. What's New in Topo USA 8.0 Updates Over 300,000 new or updated U.S. streets Thousands of new trails Detailed streets and roads for Canada Main roads for Mexico More than 200,000 ADDITIONAL POIs (places-of-interest) In-Vehicle Navigation & Travel Features GPS Radar--find points of interest near your current location 2-D or 3-D NavMode for hands-free full-screen view while navigating in your vehicle UMPC mode--optimizes your screen for ultra-mobile PCs and small screens Spoken directions and voice commands Plan Trip option--estimate end of day breaks, fuel stops, and fuel cost Advanced Geocaching Features If you own or purchase a PN-Series Handheld GPS, you can take advantage of several new features: Import Geocaching.com pocket queries using Cache Register (coming soon) Send-to-GPS now available for Delorme PN-Series handhelds See cache descriptions in full, with other cachers' log notes Geocaching symbols added to PN-Series symbol set support new geocaching features on PN-Series devices Drag cursor to multi-select caches from Topo USA and send them directly to your GPS Cache names on maps are hyperlinked to their Geocaching.com pages And it's by far the best value for the price--EVERYTHING needed comes in-the-box for the purchase price, including: Complete USA topographic maps and detailed streets Detailed streets for Canada; major roads for Mexico Over 4 million places-of-interest (POIs) Extensive USA trails network Flexible printing choices Ability to share your custom maps over the internet, and more Core Functions Map Controls The maps can be controlled using a variety of methods, including the traditional push-button zooming, which drills in/out while keeping the map exactly centered. Drag and Zoom CursorAlso, holding down the left mouse button enables you to drag and zoom, left-to-right, across the map, and moving the mouse in the same manner but in the opposite direction zooms out. 3-D Map Views and Flyovers See the terrain in vivid detail using the 3-D map views and controls. These realistic views also retain the various elements you add to your customized maps--trails, MapNotes, GPS waypoints, and Draw objects. The split-screen framework lets you see 2-D and 3-D maps side by side, with linked Draw tools that move and update in tandem. Now you can also grab the 3-D map views using the image grabber tool and scroll rapidly through 3-D views with great precision. The 3-D software engine has been completely rebuilt for this new version. Find DeLorme employs several powerful search capabilities within the software. The first is a simple box called QuickSearch where you type common requests including towns, cities, lakes, mountains, lat/lons, ZIP codes, street addresses and many more to receive the quickest possible matches. The Advanced Search helps clarify more complex searches to provide the best possible results. Enhancements to Find with version 8.0 include the ability to perform searches based on multiple criteria: address, intersection, place name, natural feature, coordinates, places of interest, and more. Profile The Profile tool lets you click on a route, line object, or body of water and see the elevation gain between one point and the next. Ideal for bikers, hikers, and large rigs wanting to know the elevation on their trips. It also gives you the ability to import heart rate, speed, and cadence data from GPS wrist computers, and graph them with the terrain profile for evaluation of athlete performance. Draw Add MapNotes; draw your own circles, polygons, and squares; measure the area of your draw objects--even draw in your own roads and route on them--these are amazingly powerful draw tools for the money. Measure The Measure Tool lets you measure linear distance and area on the map based on the units chosen in the Display tab of the Options dialog box. Draw polygons on the map and see the square footage or acreage of plots of land. Info Right-click on the map to learn more about what's underneath your cursor. Lat/Lon, names of streets and bodies of water--even local radio station information--can be viewed in this manner. Moreover, the bottom toolbar of the software displays a continuously updated readout of what's underneath your cursor. Attaching Images, Web URLs, and Other Documents Add your own images, live Web URLs, and diagrams and documents to the maps. Embed photos and diagrams showing fleet personnel what they will find at each location. Embed URLs next to your important stops on the maps for easy access when additional information is needed or updated online. Image Tagger Preserve a record of exactly where you took your digital photos--all you need is a GPS track. The enlargeable thumbnail images appear on your map exactly where you took them on your trip. SAMPLE MAPS 3-D, Color Imagery for a Realistic View of Your Hike Click to enlarge. Adding Unique USGS Quad Detail to Topo USA 8.0 Click to enlarge. Converting Collected GPS Tracks Into Routable Trails Click to enlarge. Creating a Trail Route That Best Suits Your Preferences Click to enlarge. Displaying Collected GPS Waypoints in Topo USA Click to enlarge. Enjoy Paperless Geocaching Click to enlarge. GeoTag Your Digital Photos by their Precise Location Click to enlarge. GPS Navigation with NavMode Click to enlarge. Large Area Perspective with Satellite Imagery Click to enlarge. Navigating on the Water with NOAA Nautical Charts Click to enlarge. Overlay Streets, Places of Interest on Aerial Imagery Click to enlarge. Planning a Fishing Trip With Atlas & Gazetteer Locations Click to enlarge. Ease of Use E-Z Nav Toolbar This handy toolbar runs along the top of the interface and offers access to commonly needed options, including GPS settings, the Measure tool, MapShare and Routing functions. Keyboard Shortcuts New is the ability to set your own keyboard shortcut preferences for optimum control when using GPS. The new key-bindings function is very powerful and advanced so you can design your ideal in-vehicle navigation solution to your own personal specifications. NetLink Netlink is your online link to DeLorme for important technical support messages, as well as the place where you select the download areas you want for use within the software. Routing Routing on Roads or Trails Create automatic road routes to get you to the trailhead, then use automatic trail routing to bring on your hikes. You can create routes from the Route tab or by simply right-clicking on the map and setting your Starts, Stops, and Finish points. Personalized Routing with Address Book New is the ability to create and save commonly-used names for places you visit often, such as "home," work," or "Dad's." Once these names are assigned, the routing Starts, Stops, and Finishes also display these names making retrieval easy for repeated usage. Customize routes using your own names and save them within MapShare for others to see--they'll be amazed. You can also include up to 200 names from your address book to be recognized using this system. Add Local Roads to Routing A DeLorme exclusive, this is the best way to update local roads when new developments are planned or added between DeLorme software releases. This tool is located within the Draw tab and lets you draw in the road segment, connecting it to another local road within the database. Assign a name, save it, and when you create automatic routes, the software is smart enough to include these new local roads in the routing calculations and also update the directions. Map & Travel Information NEW! Points-of-Interest New to Topo USA with the release of version 8.0 are 4 million places of interest from DeLorme's Street Atlas USA, including restaurants, lodgings, retailers, and businesses of all kinds. Also new are backcountry locations from the DeLorme Atlas & Gazetteers with icons for boat ramps, unique natural features, fishing and hunting locations, and more. Map Data Topo USA offers the best of both worlds--the latest USGS digital topographic data and the latest DeLorme street network. Unlike the scanned raster USGS paper maps used in many other topographic software products, this unique DeLorme vector data blending assures you are working with the latest information. Many scanned USGS quad maps were made decades ago, so the roads shown are oftentimes outdated or nonexistent. Topo USA's vector -- or intelligent -- blended data enables smart searching, better labeling, and many other powerful software capabilities. Data Download Dollars--FREE Map, Chart, and Imagery Downloads In addition to the enhanced topographic and street maps on the included Topo USA 8.0 DVD-ROM, your purchase entitles you to $40 in new Data Download Dollars, good for supplementary data as Web downloads. Options include USGS 1:24,000 quad maps, NOAA nautical charts, and the following imagery types: color aerial, black-and-white aerial, high-resolution cities (color aerial), and 10-meter color satellite. Data Download Dollars and the ability to purchase each dataset separately replaces the former Aerial Data Packets, that combined three data type together. Now you get only what you really need. Visit the Netlink tab of your DeLorme software to supplementary datasets. GPS Features PN-Series HandheldsThe Perfect Topo USA Outdoor Partner Insert positionally-accurate waypoints and tracks into the maps you upload to your PN-Series receiver Automatically create routes on roads or trails for upload to your receiver Convert collected tracks to routable roads or trails* Easily manage your collected GPS data GPS Waypoint Exchange Use the automatic route generation tool to create the route you want without hand-drawing each object. Then exchange this track log to your handheld GPS receiver and bring the information with you into the field. Bring your field data back into the desktop software from your GPS receiver to see where you have been. Supports Garmin, Magellan, and most other NMEA-compliant receivers. GPS Log Playback Create a route along your favorite trail, right-click on the trail and Save as GPS Log File. Open the GPS tab, switch to 3-D mode, and play back the log file showing various icons moving along the trail. It's as close to being there as one can get without actually leaving your home. Geocaching Support Topo USA supports popular geocaching site file formats .gpx and .loc, which makes geocaching more fun than ever. New with version 8.0 is the ability to import extended descriptions and hints from .gpx files found online at geocaching.com, also transportable to the PN-Series GPS Devices. Printing & Sharing Print Print crisp color or black & white maps that you control to best match what is seen on screen. Print maps including the elevation profile or simply print the elevation profile by itself. New in this version, print the split-screen maps showing your various datasets and Draw objects. MapShare MapShare makes it easy to share your customized maps and directions with family, friends, and business associates. You create exactly the content you want and then post within our online MapShare library, which includes private administrative tools for you to manage. Similar to some of the better online photo resources, MapShare allows you to provide controlled access to your important maps without worrying about email and spam filters. Topo USA 8.0 National Edition Submit this to Script & Style Share this on Blinklist Share this on del.icio.us Digg this! Post this on Diigo Share this on Reddit Buzz up! Stumble upon something good? Share it on StumbleUpon Share this on Technorati Share this on Mixx Post this to MySpace Submit this to DesignFloat Share this on Facebook Tweet This! Subscribe to the comments for this post? Share this on LinkedIn Seed this on Newsvine Add this to Mister Wong Add this to Izeby Share this on Tipd Share this on PFBuzz Share this on FriendFeed Mark this on BlogMarks Submit this to Twittley Share this on Fwisp Share this on BobrDobr Add this to Yandex.Bookmarks Add this to Memory.ru Add this to 100 bookmarks Add this to MyPlace Related PostsApril 22, 2010 -- Super Street Fighter IV Arcade FightStick Tournament Edition S – BlackApril 8, 2010 -- TomTom XXL 540S 5-Inch Widescreen Portable GPS Navigator World Traveler EditionApril 5, 2010 -- Garmin MapSource TOPO! US 24k West Topographic Coverage for Washington, Oregon, California, and NevadaApril 4, 2010 -- PSP 3000 Limited Edition Gran Turismo Entertainment Pack – SilverMarch 24, 2010 -- Razer DeathAdder – Gaming Mouse Left Hand EditionMarch 23, 2010 -- DeLorme Map Library Subscription Card for Topo USA 8.0 and Earthmate PN Series GPS NavigatorsMarch 20, 2010 -- Iomega eGo Mac Edition 500 GB USB 2.0/FireWire 400/800 Portable External Hard Drive 34629March 6, 2010 -- Nokia 5800 Navigation Edition Unlocked Phone with Free Voice Navigation and Nokia Navigation Accessory Kit–U.S. Version with Full WarrantyFebruary 25, 2010 -- Microsoft Windows XP Home Edition SP2B for System BuildersFebruary 19, 2010 -- Brain Play Preschool – 1st Grade, 2nd Edition -
MS SQL Database Administrator (financial district)
[Jobs, Jobs (not Steve)] (craigslist | all jobs in SF bay area)Daegis (www.daegis.com) delivers electronic discovery (ED) solutions to law firms and corporations from our offices in San Francisco, New York, Boston, and Chicago. We are a growing ED firm in a growing ED marketplace. We currently have a position open in our Software Development department for candidates who like working in a fast-paced, team centered, smaller-company environment. In addition to the qualifications listed below, the ideal candidates will be able to manage multiple projects ...
Daegis (www.daegis.com) delivers electronic discovery (ED) solutions to law firms and corporations from our offices in San Francisco, New York, Boston, and Chicago. We are a growing ED firm in a growing ED marketplace.
We currently have a position open in our Software Development department for candidates who like working in a fast-paced, team centered, smaller-company environment. In addition to the qualifications listed below, the ideal candidates will be able to manage multiple projects in various stages of development simultaneously, be a strong team player and be interested in learning more about the electronic discovery and litigation support industry.
Strong communication skills are a must.
Duties & Responsibilities:
Support, administer and maintain the production and development MS SQL Server database environments running on the Windows platform.
Work with the software development team in analyzing current database schemas and advise/implement structural improvements for increased efficiency and performance.
Assess the efficiency and performance of existing database structure and schema and advise on optimization. Work with IT and the development team to implement and maintain optimizations.
Review and advise on the impact of new applications or revisions to existing applications interacting with SQL.
Develop and implement best practices to ensure a consistent approach is applied to database designs.
Database design, creation and upgrade.
Creation and management of tables, constraints, indexes, grants and sequences.
Troubleshoot and resolve performance issues.
Database/schema backup and restore.
System monitoring utilizing scripts and enterprise tools.
Monitor and manage system generated alerts.
Assist with custom user requests.
Assist in creation of specifications and testing documentation.
Demonstrate extreme attention to detail and organization in all aspects of work.
May be responsible for after-hours technical support as needed.
Must Haves:
Proven professional experience in:
- Database/schema design;
- Tuning databases and queries;
- Altering databases/schemas, including but not limited to creating/altering tables, constraints, indexes, grants, and sequences;
- T-SQL Stored Procedures, Functions, and Packages;
- Using SQL Profiler and Query Analyzer to identify performance issues;
- With large datasets (100M+ rows);
- Database sharing
Strong knowledge of:
- Replication/Clustering/Tuning/Sizing/Monitoring;
- Backup/Recovery/Upgrade procedures;
- Database Versioning;
- Source Control;
- Team Foundation Server
Must work well as a member of a team.
Qualifications
- Bachelors Degree
- 3 - 5 years SQL Server DBA experience in a production environment working with MS SQL 2005, 2008 and T-SQL
- Excellent communication and interpersonal skills
- Excellent technical troubleshooting skills
If you are interested in applying for this position AND have the above noted experience then please follow the instructions below, it is very important that you please read and follow these instructions thoroughly.
1. Your resume submission must contain job code# 70SF03232010
2. Please include a cover letter with your salary expectations.
3. Should your resume be considered for an interview, please include a phone number and a preferred time of day to be contacted to set up an interview.
NO RECRUITING AGENCIES PLEASE. -
Is there a MySQL New feature request list anywhere?
[Programming] (Planet MySQL)Since the time that I’ve been using MySQL I have filed quite a few bug reports. Some of these have been fixed and many of the bug reports are actually new feature requests. While working with MySQL Enterprise Monitor I’ve probably filed more feature requests than bug reports. That’s fine of course and my opinion of what is needed in MySQL or Merlin is one thing, yours or the MySQL developers is something else. We all have our own needs and find things missing which would solv ...
Since the time that I’ve been using MySQL I have filed quite a few bug reports. Some of these have been fixed and many of the bug reports are actually new feature requests. While working with MySQL Enterprise Monitor I’ve probably filed more feature requests than bug reports. That’s fine of course and my opinion of what is needed in MySQL or Merlin is one thing, yours or the MySQL developers is something else. We all have our own needs and find things missing which would solve our specific problems. If I have ten feature requests open and only one could be added to the software I’d also like to be able to say: this feature is the most important one for me. However, it seems to me that there is no easy way in the mysql bug tracker at the moment to group together different types of new feature requests into groups of related features and then see the different types of requested features. I imagine many feature requests may be quite similar, but as I do not have a lot of time to look at all bugs it is easy to lose track of the things that people are asking for. It’s also likely that others who might be interested in my feature request are not aware of the request or able to say “I’d like this too”. Having a clearer list of requested new features, especially if you have a clearer idea of how many people are interested in these new features (whether paying customers or not) would surely be a good way of guiding the product’s development in the way which would be useful to a wider audience. Is there any way this can be done with MySQL, and how is this done with other products which also are complex and have “insufficient resources” to be able to satisfy everyone’s wish? Currently I do not feel that I can see where MySQL is going or work out if features that I need might actually be implemented in a reasonable time span (or at all) and that is rather frustrating. Some of the “Enterprise” type features that I think are important such as better partition management (variables such as innodb_file_per_table really suck, but the alternatives of X ibdata files which you can’t manage properly are even worse), better replication (taking out the replication process and putting into a separate daemon which would allow you to do N:1 replication, currently impossible in the current MySQL implementation but actually very useful if you want to have multiple sets of replicated databases each handling their own dataset, but with one or more central servers which see the whole combined dataset) are just larger more complex examples but many simpler changes are also important and some I get told will happen after MySQL 7. For me that’s never never land…. So is there a way that this can all be done more transparentlly? -
A new look at an old SA practice: separating /var from /
[Corporate Blogs, Enterprise, RIA (Rich Internet Apps)] (Sun Bloggers)An old school SA practice This is probably the geekiest blog title I've used - but today's blog is a short look at two variations on the old sysadmin practice of separating /var from /, inspired by recent "how do I do this?" calls. Why do it? How was it done before? This was traditionally done to ensure that growing space consumption in /var, perhaps caused by core, log or package files, didn't exhaust critical parts of your file system. This could happen because some program kept dumping co ...
An old school SA practice...
This is probably the geekiest blog title I've used - but today's blog is a short look at two variations on the old sysadmin practice of separating
/varfrom/, inspired by recent "how do I do this?" calls.Why do it? How was it done before?
This was traditionally done to ensure that growing space consumption in
/var, perhaps caused by core, log or package files, didn't exhaust critical parts of your file system. This could happen because some program kept dumping core or generating log entries. Exhausting space would add further injury by causing other failures.Several techniques can prevent such problems. One method is to use
coreadmto put core files somewhere else, and to uselogadmand/etc/logadm.confto rotate log files on a schedule that is consistent with your disk space and retention policy. But, the biggest hammer and most complete solution was to keep/varin a separate file system by giving it a dedicated UFS file system on its own disk slice. That way, even if something went amok and filled/var, it had no effect on other file systems.The disadvantage, of course, is the hassle of creating and sizing separate disk slices. You had to plan how many slices you needed and how big they were, and if you got them wrong it was really inconvenient to change them. You might have one slice and file system too big, wasting space you really needed in another slice, but reallocating space was a drag. Having storage allocated into little islands was a real time-waster, especially on the itty-bitty disk drive capacities we used to live with.
ZFS, and in this case ZFS boot, pretty much eliminated this inconvenience - as I'll discuss in a moment.But now, an old joke...
Before I go into the two examples that came up, a classic joke from mathematics or science class.
The professor is in the front of the classroom and writes an equation on the blackboard (I'm picturing the professor I had when studying Fourier transforms in EE class, but I won't try to do his accent.) Pointing to it, he tells the class, "As you can see, this theorem is clearly trivial."
Turning back to the blackboard he pauses for a moment, puts his hand on his chin and says "Hmmm.... just a moment." Now he starts working on the equation's derivation, covering blackboard after blackboard with equations - everything from α to ω. He fills all the blackboards in the classroom, mumbles "excuse me, I'll be right back," and then goes into an adjacent empty classroom to use its blackboards.
Twenty minutes pass. Finally, the professor returns to the classroom. He beams at the students with a big smile and says "I was right. It is trivial!"
I think this may be relevant to the rest of the post! :-)
The trivial case with ZFS root file system
I'll start with the straightforward case first. I was contacted by a long time friend (who has exceptional knowledge of Solaris and other operating systems, but is new to ZFS) who had wanted to restrict
I thought that this should be easy with a ZFS quota, but to be sure, I brought up a new instance of Solaris 10 under VirtualBox to run through the steps and get the right ZFS dataset name. I allocated the separate/varfor a fresh installation of Solaris 10 that he had just done. He used ZFS boot and selected the option that allocated a separate ZFS dataset for/var, and wanted to know if there was an easy way to control its size./var(it's an option you specify during install), and after installation completed I logged in and issued the following commands:# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT rpool 15.9G 4.16G 11.7G 26% ONLINE - # zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 4.62G 11.0G 34K /rpool rpool/ROOT 3.12G 11.0G 21K legacy rpool/ROOT/s10x_u8wos_08a 3.12G 11.0G 3.06G / rpool/ROOT/s10x_u8wos_08a/var 65.6M 11.0G 65.6M /var rpool/dump 1.00G 11.0G 1.00G - rpool/export 265K 11.0G 23K /export rpool/export/home 242K 11.0G 242K /export/home rpool/swap 512M 11.5G 42.0M -
Right - all I should need to do is set a quota on
rpool/ROOT/s10x_u8wos_08a/var, so let's do that. I picked a quota slightly larger than the amount of space already consumed so I could easily test filling it up by creating dummy files with random data. I did that once to make sure I didn't mess up the syntax, and once more in earnest to exceed the quota:# zfs set quota=80m rpool/ROOT/s10x_u8wos_08a/var # zfs get quota rpool/ROOT/s10x_u8wos_08a/var NAME PROPERTY VALUE SOURCE rpool/ROOT/s10x_u8wos_08a/var quota 80M local # dd if=/dev/urandom of=/var/XX1 bs=1024 count=10000 10000+0 records in 10000+0 records out # zfs list rpool/ROOT/s10x_u8wos_08a/var NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/s10x_u8wos_08a/var 75.3M 4.67M 75.3M /var # dd if=/dev/urandom of=/var/XX2 bs=1024 count=10000 write: Disc quota exceeded 4737+0 records in 4737+0 records out # # ls -l XX* -rw-r--r-- 1 root root 10240000 Mar 17 14:15 XX1 -rw-r--r-- 1 root root 4849664 Mar 17 14:16 XX2 # zfs list rpool/ROOT/s10x_u8wos_08a/var NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/s10x_u8wos_08a/var 80.1M 0 80.1M /var
Mission accomplished: the second file reached the quota allocated to this ZFS dataset as required. The only odd thing (in my opinion) is the odd spelling "Disc" instead of "Disk" in the message
write: Disc quota exceeded. So, if I'm building a Solaris system and want to keep/varfrom exhausting disk space, all I need is one command to set the quota. Sweet.A less trivial case, with zones
Shortly after the preceding example, I was contacted by a customer who wanted to do something similar to control
/varwithin Solaris Containers. He tried to create the zone with/vardefined as a delegated ZFS file system using legacy mounts. There seems to be a chicken-and-egg situation about what parts of the zone's filesystem must already be mounted before the zone can boot, but then you can't delegate it to the zone. Instead, I created a ZFS dataset and assigned it to the zone's/var:# zfs create rpool/zones/vartest # zfs list rpool/zones/vartest # cat varzone.cfg create set zonepath=/zones/varzone set autoboot=false add net set physical=e1000g0 set address=192.168.56.164 end add fs set dir=/var set special=/zones/vartest set type=lofs end add inherit-pkg-dir set dir=/opt end verify commit # zonecfg -z varzone -f varzone.cfg # zoneadm -z varzone install A ZFS file system has been created for this zone. Preparing to install zone <varzone>. Creating list of files to copy from the global zone. Copying <2899> files to the zone. Initializing zone product registry. Determining zone package initialization order. Preparing to initialize <1062> packages on the zone. Initialized <1062> packages on zone. Zone is initialized. Installation of <2> packages was skipped. The file </zones/varzone/root/var/sadm/system/logs/install_log> contains a log of the zone installation.
So far so good. After booting the zone without incident, I set a quota and fill it up (note: this is a much bigger
/varbecause I'm building a zone in a Solaris instance with a bunch of additional software in/var/sadm/pkg)# zfs list rpool/zones/vartest NAME USED AVAIL REFER MOUNTPOINT rpool/zones/vartest 274M 9.31G 274M /zones/vartest # zfs set quota=300m rpool/zones/vartest
Within the zone, I exhaust allocated space using the same method as before:
# dd if=/dev/urandom of=/var/xx1 bs=1024 count=100000 write: Disc quota exceeded 26369+0 records in 26369+0 records out
So, I was able to create a separate
/varfor the zone, and manage its space independently from the zone's root. WARNING: I do not know if this is a supported or recommended procedure, even though it seems to work. My recommendation is that it's more important to impose a quota on the zone's ZFS-based zone root, in order to control its total accumulation of disk space. That protects other zones and other applications that may be using the same ZFS pool.Conclusions
Separating
/varwas especially important with the small boot disk capacities we had to work with in Ye Olde Days, and perhaps became less important with the large disks we have now. However, this becomes important again due to the availability of relatively low capacity Solid State Disk (SSD) boot drives being used for fast local boot disks with low power consumption, and because of virtual environments in which a single Oracle Solaris instance might host many containers, each with its own/varand pattern of space consumption.So, maybe this is a useful Old School idea that has new, and slightly different relevance today.
-
Defending statistical methods
[Physics, Science] (The Reference Frame)There surely exist propositions by the skeptics - and opinions liked by many skeptics - that I find dumb. I don't know whether they're equally frequent as the alarmists' delusions but they certainly do exist. A whole article of this sort was written by Tom Siegfried in Science News, Odds Are, It's Wrong.The article whose subtitle is "Science fails to face the shortcomings of statistics" - it sounds serious, doesn't it? - was promoted at Anthony Watts' blog. The most characteristic quote in the ...
There surely exist propositions by the skeptics - and opinions liked by many skeptics - that I find dumb. I don't know whether they're equally frequent as the alarmists' delusions but they certainly do exist. A whole article of this sort was written by Tom Siegfried in Science News,
Odds Are, It's Wrong.
The article whose subtitle is "Science fails to face the shortcomings of statistics" - it sounds serious, doesn't it? - was promoted at Anthony Watts' blog. The most characteristic quote in the article is the following:
It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.
So I gather that the idea is that one should throw away all statistical methods which are a "mutant form of math". Holy cow.
There surely exist whole scientific disciplines that are trying to find tiny, homeopathic signals that can be hugely overinterpreted and hyped because the researchers are usually rewarded for such statements, regardless of their validity (or at least, they don't pay any significant price if the claims turn out to be wrong). The science of health impacts of XY is the classic example - and environmental and climate sciences may have become another.
People in those disciplines are usually led by their environment to be "finding" effects even if they don't really exist. The average ethical and intellectual qualities of the people who work in these disciplines are poor. But it's just preposterous to imagine that the right cure could be to throw away or ban all statistical methods.
Statistical methods are crucial and omnipresent
In fact, statistical methods have always been essential in any empirically based science. A theory predicts a quantity to be "P" and it is observed to be "O". The idea is that if the theory is right, "O" equals "P". In the real world, neither "O" nor "P" is known infinitely accurately.
Why? Because observations are never accurate, so "O" always has some error, at least if it is a continuous quantity. And "P" is almost always calculated by a formula that depends on other values that had to be previously measured, too. So even the predictions "P" have errors. There are various kinds of errors that contribute and they would deserve a separate lecture.
Moreover, quantum mechanics implies that all observations ever made have some uncertainty and all of them are statistical in character. The most complete possible theories can only predict the probabilities of individual outcomes. Clearly, all observations you can ever make have a statistical nature. In particular, experimental particle physics would be impossible without statistics. If you can't deal with the statistical nature of the empirical evidence, you simply can't do empirical science.
Now, if "O" and "P" are known with some errors, how do you determine whether the theory passes or fails the test? The errors are never "strict": there is always a nonzero probability that a very big error, much bigger than the expected one, is accumulated, so you should never imagine that the intervals "(O - error, O + error)" are "strictly certain". Nothing is certain. If you pick 1,000 random people, the deviation of the number of women from 500 may be around 30 but it is unlikely, but not impossible, that there will be 950 women and 50 men.
The answer to the question that started the previous paragraph is, of course, that if "O" and "P" are (much) further from one another than both errors of "O" as well as "P", the theory is falsified. It's proven wrong. If they're close enough to one another, the theory may pass the test: we failed to disprove it. But as always in science, it doesn't mean that the theory has been proven valid. Theories are never proven valid "permanently". They're only temporarily valid until a better, more accurate, newer, or more complete test finds a discrepancy and falsifies them.
In the distant past, people wanted to learn "approximate", qualitatively correct theories. So the hypotheses that would eventually be ruled out used to be "very wrong". Their predicted "P" was so far from the observations "O" that you could have called the disagreement "qualitative" in character. However, strictly speaking, it was never qualitative. It was just quantitative - and large.
But as our theories of anything in the physical Universe are getting more accurate, it is completely natural that the differences between "O" and "P" of the viable candidate hypotheses is getting smaller, in units of the errors of "O" or "P". In some sense, the new scientific findings at the "cutting edge" or the "frontier" almost always emerge from the "mud" in which "O" and "P" looked compatible. When the accuracy of "O" or "P" increases, we can suddenly see that there's a discrepancy.
We should always ask how big a discrepancy between "O" and "P" is needed for us to claim that we have falsified a theory. This is a delicate problem because there will always be a nonzero probability that the discrepancy has occurred by chance. We don't want to make mistakes. So we want to be e.g. 99.9% sure that if we say that a theory has been falsified, it's really wrong.
The required separation between "O" and "P" can be calculated from the figure above, from 99.9%. If you don't know the magic of statistical distributions, especially the normal one, I won't be teaching you about them in this particular text. But it's true that the probability that the falsification "shouldn't have been done" because the disagreement was just due to chance is decreasing more quickly than exponentially with the separation - as the Gaussian.
So e.g. particle physics typically expects "new theories" to be supported by "5 sigma signals" - in some sense, the distance between "O" and "P" is at least 5 times their error. The probability that this takes place by chance is smaller than one in one million. Particle physicists choose such a big separation - and huge confidence level - because they don't want to flood their discipline with lots of poorly justified speculations. They want to rely upon solid foundations so statistical tests have to be really convincing.
Softer disciplines typically choose less than 5 sigma to be enough: 2 or even 1 sigma is sometimes presented as a signal that matters. Of course, this is because they actually want to produce lots of results even though they may be (and, sometimes, are likely to be) rubbish. But a simple fix is that they should raise the required confidence level for their assertions - e.g. from 2 sigma to 5 sigma. They don't have to immediately throw statistics as a tool away.
A problem is that many of these researchers actually don't want to do it. They don't want their science to work right. They have other interests.
In fact, while the confidence level is dramatically increased if we go from 2 sigma to 5 sigma (something like from 90% to 99.9999%), the required amount of data we need to collect to get the 5-sigma accuracy is just "several times" bigger than for the 2-sigma accuracy. So if there's some effect, it's not such a huge sin to demand that published "discoveries" should be supported by 5-sigma signals. Once again, the soft scientists - who propose various theories of health (what is healthy for you) - are choosing low confidence levels deliberately because they like to present new results even though they're mostly bogus. They still get famous along the way.
If some key statements about AGW are only claimed to be established at the 90% confidence level, it's just an extremely poor evidence (and may be overstated or depend on the methods, anyway). In principle, it shouldn't be hard for the evidence for such a hypothesis, assuming it's true, to be strengthened to 99.9% or more. That's what "hard sciences" deserving the name require, anyway.
The laymen usually misunderstand how little "90%" is as a confidence level - and some traders with fear masterfully abuse this ignorance. 90% vs 10% is not that "qualitatively" far from 50% vs 50% - and one can transform one to the other by a "slight" pressure in the methodology and the formulae. If you want to be scientifically confident about a conclusion, you should really demand 99.9% or more. And it's actually not that hard to obtain such stronger evidence assuming that your hypothesis is actually correct and the "signal" exists.
Falsifying a null hypothesis
I must explain some basic points of the statistical methods. Typically, we want to find out whether a new effect exists. So we have two competing hypotheses: I will call them the null hypothesis and the alternative hypothesis.
The null hypothesis says that no new effect exists - everything is explained by the old theories that have been temporarily established and any pattern is due to chance. When I say "chance", it's important to realize that one must specify the exact character of the "random generator" that produces these random data, including the deviations, correlations, autocorrelations, persistence, color of the noise etc. There's not just one "chance": there are infinitely many "chances" given by "statistical distributions" and we must be damn accurate about what the null hypothesis actually says. (Often, we mean the "white noise" and "independent random numbers" etc.)
The alternative hypothesis says that a new effect is needed: the old explanations and the null hypothesis is not enough.
How do you decide in between these two? Well, you calculate the probability that the apparently observed "pattern" could have occurred by chance assuming the null hypothesis. If the probability of something like that were sufficiently high, e.g. 1% or 5%, you say that your data don't contain evidence for the alternative hypothesis.
If the calculated probability that the "pattern" in the data could have been explained by chance - and by the null hypothesis - is really tiny, e.g. 10^{-6}, then your data give you a strong evidence that the null hypothesis is wrong. If you say that it's wrong, your risk of having made a wrong conclusion - the so-called "false positive" or "type I error" - is only 10^{-6}. So it's sensible to take this risk. In my example, we falsified the null hypothesis at the 99.9999% level. It's very likely that a new effect has to exist.
You're expected to have an alternative hypothesis that actually describes the data more accurately and gives a higher probability that the data could have occurred according to the alternative hypothesis, with its new understanding of "chance".
However, if the probability of getting the pattern by chance, from the null hypothesis, is substantial, e.g. 10%, then your data only provide you with a very weak hint that a new effect could exist. If you use the standards of hard sciences, you should say that your data can't settle the question in either way.
Of course, it is always possible that if you make such a conclusion, you have made another kind of error, the "type II error", also known as the false negative. But what Tom Siegfried seems to misunderstand is that this is a common situation that you simply can't avoid in most cases. The data, with their limited volume and limited accuracy (and assuming a small size of the new effect), simply can't settle the question in either way.
So when you say that you don't have enough evidence to confirm the "pattern", i.e. that the data don't contain a statistically significant evidence for the alternative hypothesis i.e. the new effect, it is not the ultimate proof that the alternative hypothesis is wrong. It is not the final proof that the new effect can't exist.
It's just evidence that the new effect is small and unimportant enough so that it couldn't have been detected in the particular sample or experiment. You can't make a final decision here. While hypotheses can be kind of "completely killed" in science, they can never be "completely proved". Even though the null hypothesis can be pretty much safely killed, no one can ever guarantee to you that your particular generalization, your alternative hypothesis, is the most correct one. It could have been better than the null hypothesis in passing this particular test but the next one may falsify your alternative hypothesis, too.
There's no straightforward way to construct better hypotheses! Creativity and intuition is needed before your viable attempts are tested against the data.
And quite often, your data simply don't contain enough information to decide. This is not a bug that you should blame on the statistical method. The statistical method is innocent. It is telling you the truth and the truth is that we don't know. The laymen may often be scared by the idea that we don't know something - and they often prefer fake and wrong knowledge over admitting that we don't know - but it's their illness, their inability to live with what the actual science is telling us (or not telling us, in this case), not a bug of the statistical method.
Misinterpretations, errors, lousy scientists
Of course, the picture above assumes that one actually learns how the statistical method works and what it exactly allows us to claim in particular situations. That has nothing to do with the journalists' or laymen's interpretations. The journalists and other laymen usually don't understand statistics well - and sometimes they want to mislead others deliberately.
But again, it would be ludicrous to blame this fact on the statistical method.
Analogously, bad scientists may calculate confidence levels incorrectly. They may choose unrealistic null and/or alternative hypotheses. And they may misinterpret what their test has really demonstrated and what it hasn't. They may hold completely unrealistic beliefs about the odds that a "generic" hypothesis would pass a similar test so they can't place their calculation in any proper context. Quite typically, such people only blindly follow some statistical recipes that they don't quite understand. So it's not shocking that they can end up with mistakes.
This fact is not specific to statistics. People who are lousy scientists often make errors in non-statistical scientific methodologies, too. That's not a reason to abandon science, is it?
The proper statistical method gives us the best tool to study the incomplete or inaccurate empirical information - and in the real world, every empirical information is incomplete or inaccurate, at least to some extent. And one can actually prove that the probability of a "false positive" is as small as the p-value. Well, the p-value is not quite the same thing as the probability of a "false positive" but it's pretty close.
But "false negatives" can never be reliably cured. Whenever your experiment is not accurate enough, it will simply say "no pattern seen" even though a better experiment could see it.
The solution to fight against the widespread errors is to require the soft disciplines to become harder - to calculate the confidence levels properly and to require higher confidence levels than those that have been enough for a "discovery" in the recent decades. This recommendation follows from common sense. If your field has been flooded by lots of beliefs in correlations and mechanisms that often turned out to be incorrect or non-existent, it's clear that you should make your standards more stringent.
Scientists, journalists, and laymen should do their best to be accurate and to learn what various tests actually imply.
But it will still be true that no science can be done "quite" without any statistical reasoning. And it's still true that the datasets and experiments will continue to be unable to give the "final answer" to many questions we would like to be answered. These are just facts. You may dislike them but that's the only thing you can do against facts.
So I would urge everyone to try to avoid bombshell statements such as "statistics is a dirty core of science that doesn't work and has to be abandoned". Lousy work of some people can't ever justify such far-reaching claims.
After all, much of the lousy work - and lousy presentation in the media - emerges because the people want to claim that the relevant research is "less statistical" in character than it actually is. In most cases, weak statistical signals are being promoted to a kind of "near certainty". So the right solution is for everyone to be more appreciative of the statistical method, not less so!
And that's the memo. -
Security Response Lead for Mobile Security Startup (SOMA / south beach)
[Jobs, Jobs (not Steve)] (craigslist | software/QA/DBA/etc jobs in SF bay area)Come join Lookout! Were a venture-funded startup solving big problems on small devices. http://www.mylookout.com We believe that people should be able to use their mobile phones without having to worry about hackers, viruses, or other hazards of modern life. Were building a product that keeps your mobile phone (and millions of others) safe from all sorts of nasty things: a product people want because it actually makes their lives easier. Why Lookout? We love creating useful, u ...
Come join Lookout! Were a venture-funded startup solving big problems on small devices.
http://www.mylookout.com
We believe that people should be able to use their mobile phones without having to worry about hackers, viruses, or other hazards of modern life. Were building a product that keeps your mobile phone (and millions of others) safe from all sorts of nasty things: a product people want because it actually makes their lives easier.
Why Lookout?
We love creating useful, usable, and technologically game-changing software. If we do things right, your work will be instrumental in helping people all over the world be safe.
Besides building something meaningful, heres why youll love working here:
- Were a start-up: its fun, dynamic, and you will make a difference.
- Weve got smart people without bureaucracy or politics.
- We work in the heart of SoMA in Twitter's previous office: right near Caltrain, MUNI, and BART
- Were funded by top-tier venture and angel investors who have helped build companies such as Sun Microsystems, PayPal, Juniper Networks, Good Technology, Vontu, and Symantec.
- We offer competitive salaries, benefits, and stock options.
- Youll have the important, little perks such as a fast computer with two big monitors, a laptop, a smartphone, a nice chair, and free food and drinks.
We've recently been featured in the Forbes, CNET, DarkReading, the New York Times and others. Here are some links to our coverage:
http://www.darkreading.com/insiderthreat/security/client/showArticle.jhtml?articleID=222002886
http://www.forbes.com/2009/12/22/mobile-security-software-technology-cio-network-lookout.html
What youll be doing:
Were looking for someone to head up our security response team. Overall, the team will be responsible for hunting down exploits and malware for mobile phones and generally making sure that we are continually protecting our users from all types of attack.
Youll be responsible for building automated analysis systems and developing new techniques for sifting through large datasets in order to identify emerging threats. Youll explore attack vectors that exploits and malware may use in the future (doing research and proof-of-concept development) to make sure were always ahead threats in the wild.
The ideal person for this role is extremely creative in solving technical problems and adept at a wide range of technologies, as youll be operating everywhere from the lowest layers of mobile device kernels to distributed data analysis systems.
Successful applicants are responsible, self-motivated, and confident; can get things done; can intuitively anticipate problems; look beyond immediate issues; and take initiative to improve both our software and our development infrastructure. In short, we look for people who take pride in the craft of software engineering and have proven to be quite good at it.
We believe in agile software development, metrics, short feedback loops, well-designed APIs, test driven development, automation wherever possible, and all sorts of other things to make sure we can minimize friction and focus on solving the big problems
Key responsibilities:
- Build and use automated systems for identifying mobile threats
- Be the technical lead on our security response team, making sure that our users are always protected from mobile threats
- Define and own the security response process
- Hire and manage a small team of security response engineers
- Proactively research new attack vectors on mobile
Requirements:
- Strong software engineering skills
- Experience in a technical leadership position (as a team lead or manager)
- Well versed in security concepts (e.g. software exploitation, malware, network attacks)
- Expertise in a broad range of technologies
- Prior security experience
Bonus points:
- Experience working with distributed data analysis frameworks (e.g. map-reduce)
- Experience analyzing/reverse-engineering software at the assembly/bytecode level
- Understanding of static and dynamic analysis techniques
- Ruby and mobile development experience
- Penetration testing experience (we continually try hack our own product)
- Security response experience
- Vulnerability disclosure experience
- Significant management experience
- Startup experience
So, now what?
Send an email to people /at/ mylookout /dot/ com, send us a resume, and tell us what interests you.
We cant wait to hear from you!
All positions are full time in San Francisco, CA. Please no recruiters, contractors, or outsourcing firms. -
A little bit of the Others in Us [Gene Expression]
[Science] (ScienceBlogs Channel : Life Science)Dienekes has reposted some of the abstracts from the meeting of the American Association of Physical Anthropologists. This one caught my eye, Genetic analyses reveal a history of serial founder effects, admixture between long separated founding populations in Oceania, and interbreeding with archaic humans: Genetic anthropologists continue to debate whether human neutral genetic variation primarily reflects a continuum of demes connected by local gene flow or colonization and serial founder effe ...
Dienekes has reposted some of the abstracts from the meeting of the American Association of Physical Anthropologists. This one caught my eye, Genetic analyses reveal a history of serial founder effects, admixture between long separated founding populations in Oceania, and interbreeding with archaic humans:
Genetic anthropologists continue to debate whether human neutral genetic variation primarily reflects a continuum of demes connected by local gene flow or colonization and serial founder effects. A second unresolved issue concerns the genetic contribution of archaic species to the modern human gene pool. Some studies suggest that this contribution was substantial and that it played an important role in human adaptation. These issues remain unresolved because of inadequacies and biases in datasets, problems in statistical methodology, and the failure to recognize that different evolutionary processes may produce similar outcomes. This study redresses these limitations by analyzing gene identity within and between populations in a dataset comprised of 614 STRs assayed in 1,983 people from 99 widespread populations. Our strategy is to fit hierarchical models to these data and examine residual deviations from the models. Each model involves nesting smaller units such as populations into larger units such as continental regions. It is possible to restate many of these models as either expansions or reductions of each other and thereby identify aspects of population structure that have had a major impact on the overall pattern of diversity. The strong fit of a model estimated using the Neighbor Joining algorithm indicates that human genetic diversity primarily reflects a history of successive founder effects associated with our exodus from Africa, not a continuum of demes connected by gene flow. Residual deviations from the model suggest: 1) the genomes of Oceanic peoples are the product of two independent waves of migration to the region and admixture, and 2) genetic exchange occurred between archaic and modern humans after their initial divergence.
Would be nice if they found a gene which was likely differentiated between archaic and modern alleles, but it doesn't look like that. But the number of populations seems rather large.
Read the comments on this post... -
Can the Real-Time Web Be Realized? Notes from SxSWi
[Social Media] (Ignite Social Media Feed)One of the more interesting panels I was able to attend at this year's South by Southwest Interactive Festival was called "Can the Real-Time Web Be Realized?" As status updates (via Twitter, via Facebook, via Foursquare) become increasingly important, the ability to figure out what's happening increases. As long as we can organize that data. But beyond that, searching for products to buy, and being able to see if it's in stock near you, up to the minute, would be equally important (particularly ...
One of the more interesting panels I was able to attend at this year's South by Southwest Interactive Festival was called "Can the Real-Time Web Be Realized?" As status updates (via Twitter, via Facebook, via Foursquare) become increasingly important, the ability to figure out what's happening increases. As long as we can organize that data.
But beyond that, searching for products to buy, and being able to see if it's in stock near you, up to the minute, would be equally important (particularly around Christmas, when this year's version of "Tickle Me Elmo" is being fought over.)
The panelists all have a stake in the game. They include:
- Scott Raymond, Gowalla
- Brett Slatkin, Google
- Dare Obasanjo, Microsoft
- Marshall Kilpatrick, ReadWriteWeb (moderator)
- Jack Moffett, Collecta, a real-time search engine
Data Formats
To get this to work to maximum effect, you need to have uniform data strreams. Moffett pointed out that some formats, like Atom, RSS and even PubSubHubbub (Push) are good starts, but they have limitations. We need to push toward uniform data sources that can deal with scale.
Interoperability
Brett Slatkin pointed out the issues now with cross platform compatibility. Remember when you had to pay extra to call someone on another cell phone network? Remember when you could only text message or instant message with people on your same network. We're seeing that now, where Buzz doesn't communicate (directly) with Twitter or Identica. Gowalla and Foursquare are different platforms that don't talk. Ideally, you could use the platform that worked best for you and the data would work together.
But, Which Specifications?
Obasanjo pointed out that developing 6-7 different specifications sounds like "a lot of work for a whole lot of people." What we need instead is a serious of protocols that are not proprietary, but allow the data to flow. The end user doesn't care what platform they are in, they only care if it took a Twitter update took 2 hours to get into their system.
How Does Business Match Common Ground?
Raymond of Gowalla noted that "there's a whole lot of work that needs to be done" on the balance between the individual incentive of the company (Gowalla has an interest in attracting users at the expense of Foursquare right now, for example), but the community has an interest in open platforms. When a company has only so much effort they can apply to growth, how much do they apply to their network and standards, or how much do they put into building their closed gardens? I'm paraphrasing him, but that was the main point.
What If You Want to Delete Something that's Been Shared?
Moffitt wondered about the issue of deleting something if it's already been shared all around the web. For example, if you upload a Flickr photo which then goes out to Google, Bing, and Collecta search engines, but you then decide to delete it, what happens? Currently the streams picked up by these other platforms don't include a "delete this if you have a copy of it." So once it's out there, it's out there. Slatkin of Google felt that dealing with deleting is technically simple, while dealing with much larger issues of access control need a lot of work. In other words, who gets to see your data, and can you segment that (easily) by each piece of content?
With 400m on Facebook, Are We Already There?
Kilpatrick asked if the public, at least 400m of them, have already voted by becoming Facebook users? But others, including Obasanjo said we're doing "a horrible job" balancing Facebook's goal (pushing out widely shared content for an active network) and the end user's goal of throttling their data down to only those they want to see it. Moffitt noted that he doesn't want to share his Amazon purchases online, but they use it for recommendations anyway. And he pointed out that Netflix shared a very small, "anonymized" dataset of users, but smart people realized that if they could decode the person who gave a review, they could follow that data to figure out who many of the users were. Once you know that, you know the movies they like to watch, ostensibly private data. That led the FTC to step in and pressure Netflix to cancel the second round of their effort to improve their recommendation engine with crowdsourcing.
Away from Data Silos, Focus on People
Obasanjo made an excellent point that anyone at SxSW can relate to: To figure out what your friends are doing here, you need to check Foursquare, and Gowalla and Twitter. The ideal situation would be for a person to be able to look in one place about their friend, and get a picture of all their doing. A people hub. Slatkin, however, wondered how you monetize that data if you can't serve ads up next to it (he's been taught well at Google). Obasanjo noted the conflict, but figured the person who does the best focusing on the person first will do well for their brands. Good point.
-
Dedup Performance Considerations
[Corporate Blogs, Enterprise, RIA (Rich Internet Apps)] (Sun Bloggers)One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals. ZFS Dedup Basics Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, ...
One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.
ZFS Dedup Basics
Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.
Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.
An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.
Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.
Performance Model and I/O Pipeline Differences
What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.
For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .
Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.
The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.
So how large is the dedup table ?
The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.
The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.
But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
Referenced Links :
- ZFS Dedup
- Bobn's first Look
- Dedup Community
- Dynamics of ZFS
- Write Throttle
- Recordsize
- SHA256 Performance
-
Data highlights on solar energy
[Green, Social Entrepreneurship] (Grist - the Latest from Grist)by Lester Brown Concerns about global warming, rising fossil fuel prices, and oil insecurity have prompted calls for a new energy economy, one that replaces fossil fuels with renewables. The sun is an enormous reservoir of energy; in fact, the sunlight reaching Earth in just one hour is enough to power the global economy for a whole year. Harnessing some of this energy is an essential component of Earth Policy Institute’s carbon cutting plan, as presented in chapter 5 of Plan B 4.0. Here ...
by Lester Brown
Concerns about global warming, rising fossil fuel prices, and oil insecurity have prompted calls for a new energy economy, one that replaces fossil fuels with renewables. The sun is an enormous reservoir of energy; in fact, the sunlight reaching Earth in just one hour is enough to power the global economy for a whole year. Harnessing some of this energy is an essential component of Earth Policy Institute’s carbon cutting plan, as presented in chapter 5 of Plan B 4.0. Here are some highlights from the accompanying data on three types of solar energy: solar photovoltaics (PVs), concentrated solar thermal power (CSP), and solar water and space heating.
Annual production of solar photovoltaics reached nearly 7,000 megawatts in 2008. Although this technology for converting sunlight into electricity was developed in the United States, Japan took an early lead in production, surpassed only in recent years by China and Germany. Chinese annual production skyrocketed from 40 megawatts in 2004 to 1,848 megawatts in 2008, nearly five times the output of the United States. Currently almost all of China’s production is for the export market, but several massive domestic installations are being planned.
Graph on Annual Solar Photovoltaics Production in Selected Countries, 1995-2008
At the end of 2008, the world had a cumulative total of 15,000 megawatts in PV installations. Though Germany is far from the world’s sunniest country, government policies have made it the global PV leader, with an installed capacity of 5,308 megawatts. Other countries with large solar installations are Spain with 3,223 megawatts, Japan with 2,149 megawatts, and the United States with 1,173 megawatts.
Rooftop solar water and space heaters that directly convert sunlight into heat have been embraced in a number of countries but nowhere as much as in China. With nearly 80,000 thermal megawatts of capacity (enough for 27 million homes), China accounts for two-thirds of the world’s 120,000 thermal megawatt capacity. Turkey comes in at a distant second with 7,100 thermal megawatts. In per capita terms, Cyprus and Israel lead the list with 0.9 and 0.7 square meters, respectively.
Graph on Solar Water and Space Heating Capacity in Top Countries, 2007
New solar thermal power projects, which use mirrors to concentrate sunlight on a liquid-filled vessel to produce steam that drives a turbine, are coming online again after a 16-year hiatus. Since 2006, world capacity has grown by over 450 megawatts to a total of 820 megawatts, enough to power 156,000 American homes for one year. Scores of new projects are in the pipeline. When those currently under construction are completed, the world CSP capacity will increase almost 4-fold. There are an even greater number of projects in the contract or development stages. In the United States alone, projects under development exceed 10,000 megawatts, 20-times greater than the combined capacities of plants currently in operation and under construction.
Graph on World Istalled Concentrating Solar Thermal Power Capacity, 1980-March 2010
Avoiding dangerous climate destabilization requires a Plan B: reducing global net carbon dioxide emissions 80 percent by 2020. Achieving this goal requires a transition from fossil fuels to renewable energy from wind, solar, and geothermal sources. Current trajectories, national targets, and available resources indicate that the 100-fold increase for PV and solar rooftop heaters and the 200-fold increase for CSP, as called for in Plan B, are within reach.
You can download our datasets or read the book to learn more about solar power’s role in the plan to stabilize climate.
Related Links:
Why pricing emissions is the least important policy
On rooftops worldwide, a solar water heating revolution
Challenging conventional wisdom on renewable energy’s limits
-
New and Exciting in PLoS ONE [A Blog Around The Clock]
[Science] (ScienceBlogs Channel : Life Science)There are 35 new articles in PLoS ONE today. As always, you should rate the articles, post notes and comments and send trackbacks when you blog about the papers. You can now also easily place articles on various social services (CiteULike, Mendeley, Connotea, Stumbleupon, Facebook and Digg) with just one click. Here are my own picks for the week - you go and look for your own favourites: Does Tropical Forest Fragmentation Increase Long-Term Variability of Butterfly Communities?: Habitat frag ...
There are 35 new articles in PLoS ONE today. As always, you should rate the articles, post notes and comments and send trackbacks when you blog about the papers. You can now also easily place articles on various social services (CiteULike, Mendeley, Connotea, Stumbleupon, Facebook and Digg) with just one click. Here are my own picks for the week - you go and look for your own favourites:
Does Tropical Forest Fragmentation Increase Long-Term Variability of Butterfly Communities?:
Habitat fragmentation is a major driver of biodiversity loss. Yet, the overall effects of fragmentation on biodiversity may be obscured by differences in responses among species. These opposing responses to fragmentation may be manifest in higher variability in species richness and abundance (termed hyperdynamism), and in predictable changes in community composition. We tested whether forest fragmentation causes long-term hyperdynamism in butterfly communities, a taxon that naturally displays large variations in species richness and community composition. Using a dataset from an experimentally fragmented landscape in the central Amazon that spanned 11 years, we evaluated the effect of fragmentation on changes in species richness and community composition through time. Overall, adjusted species richness (adjusted for survey duration) did not differ between fragmented forest and intact forest. However, spatial and temporal variation of adjusted species richness was significantly higher in fragmented forests relative to intact forest. This variation was associated with changes in butterfly community composition, specifically lower proportions of understory shade species and higher proportions of edge species in fragmented forest. Analysis of rarefied species richness, estimated using indices of butterfly abundance, showed no differences between fragmented and intact forest plots in spatial or temporal variation. These results do not contradict the results from adjusted species richness, but rather suggest that higher variability in butterfly adjusted species richness may be explained by changes in butterfly abundance. Combined, these results indicate that butterfly communities in fragmented tropical forests are more variable than in intact forest, and that the natural variability of butterflies was not a buffer against the effects of fragmentation on community dynamics.
Large-Scale Movement and Reef Fidelity of Grey Reef Sharks:
Despite an Indo-Pacific wide distribution, the movement patterns of grey reef sharks (Carcharhinus amblyrhynchos) and fidelity to individual reef platforms has gone largely unstudied. Their wide distribution implies that some individuals have dispersed throughout tropical waters of the Indo-Pacific, but data on large-scale movements do not exist. We present data from nine C. amblyrhynchos monitored within the Great Barrier Reef and Coral Sea off the coast of Australia. Shark presence and movements were monitored via an array of acoustic receivers for a period of six months in 2008. During the course of this monitoring few individuals showed fidelity to an individual reef suggesting that current protective areas have limited utility for this species. One individual undertook a large-scale movement (134 km) between the Coral Sea and Great Barrier Reef, providing the first evidence of direct linkage of C. amblyrhynchos populations between these two regions. Results indicate limited reef fidelity and evidence of large-scale movements within northern Australian waters.
Evolutionary Divergence in Brain Size between Migratory and Resident Birds:
Despite important recent progress in our understanding of brain evolution, controversy remains regarding the evolutionary forces that have driven its enormous diversification in size. Here, we report that in passerine birds, migratory species tend to have brains that are substantially smaller (relative to body size) than those of resident species, confirming and generalizing previous studies. Phylogenetic reconstructions based on Bayesian Markov chain methods suggest an evolutionary scenario in which some large brained tropical passerines that invaded more seasonal regions evolved migratory behavior and migration itself selected for smaller brain size. Selection for smaller brains in migratory birds may arise from the energetic and developmental costs associated with a highly mobile life cycle, a possibility that is supported by a path analysis. Nevertheless, an important fraction (over 68%) of the correlation between brain mass and migratory distance comes from a direct effect of migration on brain size, perhaps reflecting costs associated with cognitive functions that have become less necessary in migratory species. Overall, our results highlight the importance of retrospective analyses in identifying selective pressures that have shaped brain evolution, and indicate that when it comes to the brain, larger is not always better.
How Accurate and Robust Are the Phylogenetic Estimates of Austronesian Language Relationships?:
We recently used computational phylogenetic methods on lexical data to test between two scenarios for the peopling of the Pacific. Our analyses of lexical data supported a pulse-pause scenario of Pacific settlement in which the Austronesian speakers originated in Taiwan around 5,200 years ago and rapidly spread through the Pacific in a series of expansion pulses and settlement pauses. We claimed that there was high congruence between traditional language subgroups and those observed in the language phylogenies, and that the estimated age of the Austronesian expansion at 5,200 years ago was consistent with the archaeological evidence. However, the congruence between the language phylogenies and the evidence from historical linguistics was not quantitatively assessed using tree comparison metrics. The robustness of the divergence time estimates to different calibration points was also not investigated exhaustively. Here we address these limitations by using a systematic tree comparison metric to calculate the similarity between the Bayesian phylogenetic trees and the subgroups proposed by historical linguistics, and by re-estimating the age of the Austronesian expansion using only the most robust calibrations. The results show that the Austronesian language phylogenies are highly congruent with the traditional subgroupings, and the date estimates are robust even when calculated using a restricted set of historical calibrations.
Background
A systematic review was conducted for the association between animal feeding operations (AFOs) and the health of individuals living near AFOs. The review was restricted to studies reporting respiratory, gastrointestinal and mental health outcomes in individuals living near AFOs in North America, European Union, United Kingdom, and Scandinavia. From June to September 2008 searches were conducted in PUBMED, CAB, Web-of-Science, and Agricola with no restrictions. Hand searching of narrative reviews was also used. Two reviewers independently evaluated the role of chance, confounding, information, selection and analytic bias on the study outcome. Nine relevant studies were identified. The studies were heterogeneous with respect to outcomes and exposures assessed. Few studies reported an association between surrogate clinical outcomes and AFO proximity. A negative association was reported when odor was the measure of exposure to AFOs and self-reported disease, the measure of outcome. There was evidence of an association between self-reported disease and proximity to AFO in individuals annoyed by AFO odor. There was inconsistent evidence of a weak association between self-reported disease in people with allergies or familial history of allergies. No consistent dose response relationship between exposure and disease was observable.
Human Mammary Epithelial Cells Exhibit a Bimodal Correlated Random Walk Pattern:
Organisms, at scales ranging from unicellular to mammals, have been known to exhibit foraging behavior described by random walks whose segments confirm to Lévy or exponential distributions. For the first time, we present evidence that single cells (mammary epithelial cells) that exist in multi-cellular organisms (humans) follow a bimodal correlated random walk (BCRW). Cellular tracks of MCF-10A pBabe, neuN and neuT random migration on 2-D plastic substrates, analyzed using bimodal analysis, were found to reveal the BCRW pattern. We find two types of exponentially distributed correlated flights (corresponding to what we refer to as the directional and re-orientation phases) each having its own correlation between move step-lengths within flights. The exponential distribution of flight lengths was confirmed using different analysis methods (logarithmic binning with normalization, survival frequency plots and maximum likelihood estimation). Because of the presence of non-uniform turn angle distribution of move step-lengths within a flight and two different types of flights, we propose that the epithelial random walk is a BCRW comprising of two alternating modes with varying degree of correlations, rather than a simple persistent random walk. A BCRW model rather than a simple persistent random walk correctly matches the super-diffusivity in the cell migration paths as indicated by simulations based on the BCRW model.
Localization of Canine Brachycephaly Using an Across Breed Mapping Approach:
The domestic dog, Canis familiaris, exhibits profound phenotypic diversity and is an ideal model organism for the genetic dissection of simple and complex traits. However, some of the most interesting phenotypes are fixed in particular breeds and are therefore less tractable to genetic analysis using classical segregation-based mapping approaches. We implemented an across breed mapping approach using a moderately dense SNP array, a low number of animals and breeds carefully selected for the phenotypes of interest to identify genetic variants responsible for breed-defining characteristics. Using a modest number of affected (10-30) and control (20-60) samples from multiple breeds, the correct chromosomal assignment was identified in a proof of concept experiment using three previously defined loci; hyperuricosuria, white spotting and chondrodysplasia. Genome-wide association was performed in a similar manner for one of the most striking morphological traits in dogs: brachycephalic head type. Although candidate gene approaches based on comparable phenotypes in mice and humans have been utilized for this trait, the causative gene has remained elusive using this method. Samples from nine affected breeds and thirteen control breeds identified strong genome-wide associations for brachycephalic head type on Cfa 1. Two independent datasets identified the same genomic region. Levels of relative heterozygosity in the associated region indicate that it has been subjected to a selective sweep, consistent with it being a breed defining morphological characteristic. Genotyping additional dogs in the region confirmed the association. To date, the genetic structure of dog breeds has primarily been exploited for genome wide association for segregating traits. These results demonstrate that non-segregating traits under strong selection are equally tractable to genetic analysis using small sample numbers.
Read the comments on this post... -
Accessibility and SharePoint 2010
[Windows] (MSDN Blogs)This is Tim McConnell, Program Manager on the SharePoint Foundation team. For the 2010 release, I’ve worked with SharePoint platform and partner teams to deliver powerful, reliable, accessible user experiences. Like Office, Office Web Applications, Windows, and teams across Microsoft, everyone in SharePoint strives to remove barriers that make software difficult to use. Sometimes improvements can be obvious, like the reorganized Ribbon user interface. However, some users may not notice changes ...
This is Tim McConnell, Program Manager on the SharePoint Foundation team. For the 2010 release, I’ve worked with SharePoint platform and partner teams to deliver powerful, reliable, accessible user experiences. Like Office, Office Web Applications, Windows, and teams across Microsoft, everyone in SharePoint strives to remove barriers that make software difficult to use. Sometimes improvements can be obvious, like the reorganized Ribbon user interface. However, some users may not notice changes that can transform another user’s experience. Accessible software respects the range of different users’ experiences, and it accommodates everyone.
Standards
As a starting point, SharePoint adopted the Web Content Accessibility Guidelines 2.0, WCAG 2.0, and set a goal for Level AA. Becoming a W3C recommendation on December 11th, 2008, WCAG 2.0 defines the expectations of and the techniques deployed in well-built, accessible Web sites. The SharePoint teams followed the spec’s developments, and we designed and tested SharePoint 2010 against the guidelines. WCAG 2.0 represents a modern, international standard that’s as valuable to developers as it is to Web users.
Core Investments
The four principles of WCAG 2.0 are Perceivable, Operable, Understandable, and Robust. For each area, SharePoint has made key investments, and here I’ll scratch the surface to describe a few:
Perceivable
- SharePoint 2010 delivers broad changes to describe content and media and to explain controls.
- The redesigned masterpage leverages CSS and presents content in the appropriate sequence.
Operable
- Keyboard interaction has been a cornerstone in our feature evaluations to maximize device compatibility and usability.
- Proper heading structures have been added to pages for informational, organizational, and navigational benefits.
- Core to a trustworthy interface is a dependable focus, and we’ve invested heavily in protecting the users focus and in deferring control to the user agent wherever possible.
Understandable
- Across SharePoint, we’ve improved language support, and we’ve integrated this information into our pages and into our advanced editors.
- SharePoint supports browser settings to zoom content and operating system features to increase font sizes.
Robust
- Our new design efforts let us declare DocTypes and specify CSS-standards rendering for our masterpages. This has dramatically improved our cross-browser support.
- Broad investments were made to update our markup to be like well-formed XML, and the new rich text editor has clean markup and a function to convert its content into XHTML.
We’ve tested these principles with and without Assistive Technologies to verify their value for all users.
ARIA Integration
ARIA stands for Accessible Rich Internet Applications, and it specifies descriptive extensions for Web applications. Like WCAG, WAI-ARIA is from the W3C’s Web Accessibility Initiative. In a nutshell, ARIA allows an inaccessible element, such as a div with an onclick attribute, to surface itself as a button control. This can be done with a new role attribute set to “button”—it’s that simple. SharePoint leverages ARIA in the Ribbon, in dialogs, in our new rich text editor, and elsewhere in the platform and in partner applications.
Examples of Accessibility Investments
Dialogs
In order to keep users in context for as long as possible, we’ve introduced in-browser dialogs. With a dialog, the experience of reading, editing, and creating SharePoint content moves more quickly. Since SharePoint dialogs do not open new browser windows, we’ve built in important accessibility features to help all users navigate successfully.
- Focus: SharePoint describes its dialogs using multiple accessible techniques, and form dialogs will set focus on the first form element like they would after a navigation event.
- Dismissing a dialog: depending on how a browser implements Access Keys, closing the dialog is a couple of key strokes away. For example, in Internet Explorer, a user can hit Alt+C to disregard a dialog; in Firefox users can hit Alt+Shift+C.
- Confirming a dialog: when the necessary forms have been filled, users can hit Alt+O to accept the dialog or to submit the form.
The Ribbon
As the key component of the new SharePoint 2010 user interface, the Ribbon needs to deliver powerful, useful, and usable experience. We designed the Ribbon to be accessible from the beginning, and we took advantage of multiple tools and techniques to provide a rich experience.
Keyboard Support
Keyboard support comes from the ground up. Because the Ribbon is a complicated component, it has a simple link to skip all of its commands. To help users on keyboards and alternative input devices, the Ribbon provides hidden, in-context instructions that explain its structure and how it’s controlled. Each of the Ribbon’s commands and menu anchors appear within the page’s navigation order, so it’s always safe to explore either forwards or backwards.
Tab Access
Because the Ribbon appears at the top of SharePoint pages, it’s necessary to provide quick access. The Ribbon operates as a central control for all of the components on the page, so it’s impractical to navigate back and forth for every command. To accelerate Ribbon interaction, a new shortcut key combination, Ctrl+[, will jump the focus to the first available Ribbon tab. From there, users can move back toward the Quick Access Toolbar commands and the Site Actions menu, or users can move ahead to the other Ribbon tabs.
In the following picture, the Browse tab has been highlighted to demonstrate focus after entering the Ctrl+[ shortcut key combination.
Command Access
Similar to accessing Tabs, it’s also important to quickly access commands. For this SharePoint supports the Ctrl+] shortcut key combination. This shortcut works in one of two ways:
- It selects the first command on the active Ribbon Tab.
- It selects the last used command on the active Ribbon Tab.
To move between Groups of Ribbon commands enter one of the Ctrl+Arrow Left, Ctrl+Arrow Right, Shift+Arrow Left, or Shift+Arrow Right shortcut key combinations. These shortcuts will loop through the Groups to prevent users from accidentally navigating outside of the Ribbon. The shared use of Ctrl and Shift allows for maximum browser and Assistive Technology compatibility.
Enhanced Tooltips
Enhanced tooltips describe a command’s behavior and its availability without cluttering the user interface or slowing navigation. When trying to decipher small icons or to move between many rich commands, enhanced tooltips provide the extra bit of information needed to verify your actions.
ARIA Integration
Behind the scenes in each of the three Ribbon examples are ARIA role attributes describing the structure and purpose of the Ribbon controls. Here’s a short list of attributes:
- aria-labelledby – Rich control labels
- aria-describedby – Rich control descriptions via enhanced tooltips
- aria-haspopup – Notification information to warn when a control may pop-up another control
- aria-multiline – Describes text fields for large amounts on content
- And here’s a short list of ARIA roles used within SharePoint:
- tabpanel – An expanded Ribbon Tab
- tooltip – Ribbon tooltip content
- button – An interactive button control
- dialog – An interactive dialog
Each of these simple strings dramatically changes how browsers and Assistive Technologies communicate Web content to users. While a basic a anchor tag will work for most basic command scenarios, it’s better and more reassuring to fully provide ARIA’s role=”button” syntax for clear descriptions.
InfoPath Forms
Through investments made in InfoPath Forms Services 2010, form designers can easily design and publish forms with an accessible user experience.
Assistive Technology Friendly
InfoPath forms have been designed and tested to work with browsers and assistive technologies. Broad changes have been made to describe simple controls and complex controls with field validation and relationships.
ARIA Integration
WAI-ARIA has been used to further improve the user experience on assistive technologies: ARIA is used to notify the assistive technology of form updates, alerts, warnings, and other pop-up dialogs.
Keyboard Support
Users filling forms in IPFS 2010 have full keyboard support to access all necessary functionality. InfoPath has also done work to ensure that keyboard focus is maintained in a predictable manner during dynamic changes to the form.
Project Grid Editing
The ability to display and edit tabular data is a core component of any productivity suite. SharePoint is no exception. In SharePoint Foundation 2010 we have introduced a new JavaScript based grid control that allows users to modify SharePoint Project Tasks Lists, change Project schedules, and edit Access databases. From the very early planning stages of developing this control we began to craft requirements to ensure the control was accessible. The control has complex requirements around the support of Gantt Charts and hierarchy (for Project Server) as well as very large datasets, macros and custom user validation (for Access Services). In order to ensure accessibility for these features we made use of ARIA and robust keyboard shortcuts.
ARIA Integration
Like the Ribbon, ARIA is used to achieve support for these complex requirements. Here are additional examples of how Project uses ARIA:
- aria-owns – enables focus element to be set in a input element that maps to the entire control
- aria-activedescendant – enables virtual focus element to map to a specific cell within the grid
- aria-multiselectable – indicates that multiple cell selections can be made
- aria-expanded – indicate expand/collapse state within hierarchy
- aria-busy – indicates if a row has not yet been downloaded from the server
Keyboard Navigation
SharePoint’s Grid control was designed to support keyboard navigation from day one. We know that frequently when dealing with tabular data whether it is datasheets, lists or projects, users often have many items to display on screen. Because of this we provide a simple link that allows users to skip the grid when moving through elements on a page. Additionally the Grid supports many of the keyboard shortcuts you have come to expect in desktop applications. Cell navigation can be easily performed by using directional arrow keys as well as traditional tabbing. Moving up and down within grid is easy with common shortcuts like Page Up/Page Down as well as support for Home/End and many others. Support is even present for complex selection and expanding dropdowns (Alt+Down). In Project Server the control supports changing Gantt chart zoom levels all through a couple keypresses (CTRL+* & CTRL+/), as well as expanding and collapsing hierarchy.
Conclusion
Thanks for learning more about the investments that we’ve made to make SharePoint an exceptional, versatile, and accessible web application and platform. Web technologies move quickly, and we’re always seeking new ways to present dynamic Web experiences that work for everyone. We’re proud of the richness that we’ve delivered, and we hope that you’ll discover SharePoint 2010 to be both powerful and usable.
-Tim McConnell
-
StorageMojo’s best paper of FAST ‘10
[Storage, Enterprise] (StorageMojo)StorageMojo’s best paper of FAST ‘10 is Understanding Latent Sector Errors and How to Protect Against Them (pdf) by Bianca Schroeder, Sotirios Damouras, and Phillipa Gill, University of Toronto. The paper builds on research and a dataset that StorageMojo reviewed 2 years ago in Latent sector errors in disk drives. That research analyzed the error logs of 50,000 NetApp arrays with 1.53 million enterprise and consumer drives disks. Understanding Understanding LSEs does a statistical ...
StorageMojo’s best paper of FAST ‘10 is Understanding Latent Sector Errors and How to Protect Against Them (pdf) by Bianca Schroeder, Sotirios Damouras, and Phillipa Gill, University of Toronto.
The paper builds on research and a dataset that StorageMojo reviewed 2 years ago in Latent sector errors in disk drives. That research analyzed the error logs of 50,000 NetApp arrays with 1.53 million enterprise and consumer drives disks.
Understanding
Understanding LSEs does a statistical deep dive on the disk LSE dataset and then evaluates scrubbing and intra-disk redundancy strategies against the field data.Latent sector errors are important for 3 reasons:
- 1 LSE can cause a RAID reconstruction failure in a single parity RAID system (RAID 5).
- Ever-tinier disk storage geometries make LSEs more likely.
- The insidious failure mode: no detection until access is attempted.
Schroeder et. al. used a subset of the LSE dataset that included only drives that had LSEs. This covered 29,615 nearline (presumably SATA) drives and 17,513 enterprise drives that had been in the field at least 12 months.
LSE metrics
Some of the papers conclusions:- For most drives almost all LSEs are a single error. Multiple contiguous logical block errors are less than 2.5% of all LSEs.
- If there is a 2nd error, most are within 100 sectors of the 1st error.
- Depending on the model, between 20% and 50% of errors are in the first 10% of the drive’s logical sector space. Some drives have a higher concentration of errors at the end of the drive as well.
- LSEs are highly concentrated in a few short time intervals, not randomly spread out over a drive’s life.
- It appears that events that are close in space are also close in time.
The rest of the paper
The paper also goes into 2 interesting topics – intra-disk redundancy and scrubbing strategies – that deserve posts of their own. For the latter the research found that changing the order in which sectors are scrubbed can improve mean time to error detection by 40% – with no increase in overhead or scrub frequency.Conclusions
Key quote:We observe that many of the statistical aspects of LSEs are well modeled by power-laws, including the length of error bursts (i.e. a series of contiguous sectors affected by LSEs), the number of good sectors that separate error bursts, and the number of LSEs observed per time. We find that these properties are poorly modeled by the most commonly used distributions, geometric and Poisson. Instead we observe that a Pareto distribution fits the data very well and report the parameters that provide the best fit. . . . We find no significant difference in the statistical properties of LSEs in nearline drives versus enterprise class drives.
[bolding added -ed.]
The StorageMojo take
Disk-based storage arrays are facing a real challenge from flash and possibly PCM technology. Disks win the $/GB race, but piling double and triple parity on arrays increases costs and firmware complexity.Understanding the nature of the enemy – in this case latent sector errors – helps array designers develop more reliable and cost-effective arrays. Yet one has to wonder if the RAID paradigm is reaching the end of the line.
Parallel and object-based systems from Isilon and Panasas, for example, are very fast at disk rebuilds because they can draw data from many disk drives in parallel – without the performance-killing overhead that RAID rebuilds impose.
But those are larger systems. Putting these techniques together may give us reliable and economical RAID 5 systems for the SMB market for another decade or more.
Courteous comments welcome, of course. I’ve done work for Isilon – who also advertises on StorageMojo – and Panasas. The official best paper of FAST ‘10 was quFiles which I blogged about last week.
If you spot a type please let me know. Thanks!
Copyright © 2010 StorageMojo. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact legal@storagemojo.com so we can take legal action immediately.
Plugin by TaraganaRelated posts:
- FAST ‘10 papers: wow. Just checked out the papers for this year’s FAST ‘10...
- Does RAID 6 stop working in 2019? Late last year Sun engineer, DTrace co-inventor, flash architect and...
- Why we need 4k drives WD has started shipping drives that drop the ancient 512...
Related posts brought to you by Yet Another Related Posts Plugin.
Related posts:- FAST ‘10 papers: wow. Just checked out the papers for this year’s FAST ‘10...
- Does RAID 6 stop working in 2019? Late last year Sun engineer, DTrace co-inventor, flash architect and...
- Why we need 4k drives WD has started shipping drives that drop the ancient 512...
-
Enterprise Drupal: Project Configuration Management and Release Management
[Content Management] (Oshyn Web Content Management)All Drupal developers have experienced this nightmare. You have been creating a great project, now it’s time to deploy it. No one wants to deal with it. No one wants to be the responsible of doing it. If you are building a simple site, and there’s only a one time deployment, it’s not a big deal. You just have to follow the largely discussed and know simple rules. The rules 1. Use SVN (or CVS) to keep your code updated among the developers 2. Deploy to production by ...
All Drupal developers have experienced this nightmare. You have been creating a great project, now it’s time to deploy it. No one wants to deal with it. No one wants to be the responsible of doing it.
If you are building a simple site, and there’s only a one time deployment, it’s not a big deal. You just have to follow the largely discussed and know simple rules.
The rules
- 1. Use SVN (or CVS) to keep your code updated among the developers
- 2. Deploy to production by backing up/restoring the database
- 3. Create scripts to make the deployment process automatic
When it’s about deploying code, it’s still easy to do if you followed the simple rules. You will have only to update your SVN branch and you are done.When does the nightmare begin?
The problem is deploying changes made in the CMS, but without wiping all the existing data. In other words, merging the database changes, the settings and content stored in the database.Scenario 1: Big (or huge) Drupal projects
You have to deploy a project in multiple phases. So while the first phase is deployed and working in production, you have a bunch of developers working on the next phase (this means, implementing new code, creating new nodes, creating new views, editing existing ones, changing settings, users, permissions, etc.)
Let me illustrate the problem:
The first release of the site is deployed, the database is identical. The site is looking good and working great. Everyone is happy.
A couple of days later, on the production site:
A content editor creates a new node; Drupal assigns it a nid (let’s say 100)
The content editor edits an existing node, change a node reference field, and guess what, it’s changing the node reference to the newly created node! (nid=100)
The site administrator changes some parameters in the configuration of the site (which ones? I don’t know)
The site administrator changes even more configuration in some views (let’s say changed items displayed per page, and activated some caching options)
New users enter to the site and register themselves. They’re assigned some roles and they start creating some content as well.
While on the development site:
A developer is creating a new node for a feature required for phase 2. This new node should persist in production in the next release. Drupal assigns it a nid (Yes, you are right, it’s nid 100 again!)
Another developer is creating new test content in order to test a new module he is developing. It’s working great; but wait a minute, all this new data should not be migrated to prod in the next release, right?
Another developer is creating new views, and changing some settings in the entire site. Which of these features should be migrated in the next release? I’m not sure, maybe most of them. Which ones? Hmm…. Again, I don’t know.
A few weeks later: It’s time to deploy, in the development department, some questions raise up
How do we filter what to migrate and what not migrate? Which content should we migrate?
Should we use a module like deploy? But it does not have a stable release! Will it work for us?
What about the settings? should we go one by one reproducing them in the new site?
Which settings should we reproduce? How will we know that the settings entered by the site administrator in live will be preserved?
Should we copy all the permissions? Or roles? Or even maybe some users?
Why Drupal was designed it this way? Why we can’t separate content and settings?
What will be the downtime for implementing this new release?
What happens with the nodes that has the same nid? What if they’re linked in some part that any existing module will migrate? (For example, they’re linked from a panel page)
No one knows the right answer. It’s just too complicated.
Before going to my proposed solution, let’s analyze another scenario.Scenario 2: A preview environment is required
The client wants a preview environment before doing any change, and they want to deploy all the changes they make in preview, by just clicking a button.
No, I won’t make a illustration of this scenario. This is not the worst scenario and if the site is not as complex, there could be some easy solutions.
I included this scenario in this post to make a contrast on when you can use and when you can’t use the existing Drupal deployment modules.
The proposed solutions
I will describe two possible solutions:
- 1. Starting from the end, for the scenario when a preview environment is required: use Deployment or Node Export module
- 2. For Big/Huge Drupal projects or when the first solution is not enough: Use DB synchronization software
A first solution: Use Deployment or Node Export modules
If the site is not complex, and you just want to deploy content, you could try one of these modules:
- 1. Deployment module
- 2. Node export module
Deployment module could suit perfectly for this scenario; you can see a screencast of how it works. The drawbacks of deploy module are described in their project page:
- - “Deploy is under active development. API and/or UI changes can and almost certainly will happen. Deploy for Drupal 7 will likely be a very extensive rewrite”
- - “The Drupal 5 version of Deploy is currently non-functioning and I do not plan to revive it”
- - There’s no a stable release yet
Using this module, you can allow the content editor to enter to the site’s content page, and export the nodes and then import them in the production site.
If you want to give your users even a better way to find the nodes to export, you could create a view, expose some filters and use the Views Bulk Operations module to add an Export button in the view.
When these solutions are not enough?
I’d say that you will run into problems when your site is complex. The examples that I can think on are ones that I’ve had in real projects, but I’m pretty sure that a lot of people have had more problems, beyond the ones I’m describing below.
1. You have rules that create related content.
The general rule could be that you want rules to be executed in both environments; since we’re using Drupal functions to create the content, the rules will be executed in both sides. This is great. (And this is maybe the reason of why we want to use these modules).But there could be some cases when you don’t want that behavior. Most people will ask me... when? Well, these cases only appear in real life projects, and we never read about them in the theory or see them in the demo screen casts.
You have the A and B content types. When the A node is created, you have to create also a B node, that has a node reference to the newly created A node. Great, this is easy to implement, you just have to create a rule.
So, the editor creates an A node in PREVIEW environment, the node B is created automatically by a rule. After creating the A node, the editor edits some of its fields, and edits the B node as well. Now they want it to be deployed in production.
They use deploy or node export and the nodes are in production now. The problem is that now in production you have one A node and two B nodes.
This specific problem could be solved by deactivating rules or modifying them (We could add PHP code to verify if the node exists before creating it).
Anyway I wanted to point out this problem as an example of what could happen.
2. You want also to export settings, not only nodes
What if you want an editor to be able to change a lot of settings in preview, and not on production directly? These modules won’t help us to deploy those settings. Deploy module tries to overcome this limitation, but we know that only some basic settings of Drupal can be exported. If you use a lot of modules, all those settings will not be migrated.
3. You want to export views, panels, and others
None of them are supported yet but the modules mentioned above. You have to export them and import them manually (Or create them again) in the other site.
The problem is even bigger if you are adding new nodes contents to a panel page.
A second solution: DB Synchronization
I’d recommend this solution for the scenario of big Drupal projects, (and when the first proposed solution is not enough)
If you have been googling about how to deploy a Drupal site, I’m sure you found a lot of posts that can only give you advices. They will tell you to use SVN, to use deploy or node export (or a similar module), to test the site a lot after a migration. But no one will tell you the solution. No one has created a tested solution that can be always used.
My intent in this post isn’t describing a complete solution, but it’s to describe an approach that worked for some Enterprise Drupal applications.
In simple words, the solution is: use DB comparison software instead of Drupal modules to perform the migration.
Why?
- - DB comparison tools are tested, and work fine. I don’t have to deal with bugs, (bugs in my application or bugs in deployment modules)
- - The DB always contain the latest data, I don’t have to trust that everyone is recording a log to be reproduced in the other environment
- - The migration can be done running a script, and not following a lot of steps. This will reduce the downtime when deploying a new version.
- - The nids can collide; how do I migrate content? Other primary keys can collide as well.
- - How to track what to migrate and what not to migrate?
DB migration, how the process works
1. Make an initial release, with all the data
You make an initial release of all the data: content, users, roles and settings in production site are the same that in development site. Database structure is identical.
2. Keep a snapshot of the identical database
It’s a good idea to keep a snapshot of the initial database that you are deploying. Why? You will need it later. Keep reading.
3. Just after you made the initial migration, you run a script in PRODUCTION, so the nids (and all the auto-generated numbers) will start in a higher number.
We don’t want collisions of ids, right? The idea is to separate a quantity of ids that can be used in development. While in production the new ids start in a higher number. The tables, we would look like this:
To change the ids that are going to be generated, the MySQL sentence to be used is:
ALTER TABLE node AUTO_INCREMENT = 1000;
This sentence will reserve the nids up to a thousand.
Obviously “node” is not the unique table that you want to deal with. But you don’t want to go to the database and search which fields are auto-numeric, do you?
You can find which tables have an auto-increment field, by using SHOW TABLES and SHOW COLUMNS FROM <table> WHERE extra like '%auto_increment%' scripts.
Also, there are tables that for sure you don’t want to include. You don’t want to deploy the watchdog log, or the caches tables, right?
All this process could get complicated if you are using a lot of modules in your site, so I decided to put it all together in a PHP script. Note that I’m not running the alter scripts, but just printing them. I’d like to review the script before running it in the DB, right?
<?php
// my DB prefix
$prefix = 'cms_';
$not_update_tables = array(
'accesslog',
'batch',
'boost_cache',
'boost_cache_relationships',
'boost_crawler',
'ctools_css_cache',
'ctools_object_cache',
'devel_queries',
'devel_times',
'flood',
'location_search_work',
'node_comment_statistics',
'node_counter',
'search_dataset',
'search_index',
'search_node_links',
'search_total',
'sessions',
'views_object_cache',
'watchdog',
// ANY OTHER TABLE YOU DON'T WANT TO INCLUDE
// BECAUSE YOU WILL NEVER MIGRATE IT FROM DEV TO PROD
);
$query = "SHOW TABLES";
$table_result = db_query($query);
$fields_info = array();
while ($table = db_result($table_result)){
$query = "SHOW COLUMNS FROM %s where extra like '%auto_increment%';";
$result = db_query($query, $table);
$process = TRUE;
// check if we should not process this table
foreach ($not_update_tables as $not_update_table){
$not_update_table = $preffix . $not_update_table;
if ($table == $not_update_table){
$process = FALSE;
break;
}
}
if (!$process){
continue;
}
while ($field = db_fetch_array($result)){
$query = "SELECT max(" . $field['Field'] . ") FROM $table";
$max = db_result(db_query($query));
$field_info = array(
'table' => $table,
'field' => $field['Field'],
'max' => $max,
);
$already_inserted = FALSE;
foreach($fields_info as $values){
if ($values['table'] == $table && $values['field'] == $field['Field']){
$already_inserted = TRUE;
}
}
if (!$already_inserted){
$fields_info[] = $field_info;
}
}
}
print "<h1>Tables/fields to be updated</h1>";
print "<table><tr><td>Table</td><td>Field</td><td>Current Max value</td></tr>";
foreach($fields_info as $field){
print "<tr><td>" . $field['table'] . "</td><td>" . $field['field'] . "</td><td>" . $field['max'] . "</td></tr>";
}
print "</table>";
print "<h1>Queries</h1>";
foreach($fields_info as $field){
$table = $field['table'];
$max = $field['max'];
if ($max < 500){
$max = 1000;
}elseif($max < 5000){
$max = 10000;
}elseif($max < 20000){
$max = 50000;
}elseif($max < 50000){
$max = 100000;
}elseif($max < 100000){
$max = 150000;
}
print "ALTER TABLE $table AUTO_INCREMENT = $max;<br/>";
}
?>
IMPORTANT: (a.k.a. DANGER) this script could not be complete for your site. You have to review which tables you must exclude based on the project you are working on. You have also to review the code for generating the new “auto_increment” values.
How did I determine which table should be excluded? I had to navigate through all the tables in the project and exclude the ones that I’m completely sure that we don’t want to migrate between the environments.
4. Phase 2 is ready, we want to deploy it
You have to use a DB comparison and synchronization tool in order to create a script to migrate the data from the DEV server to the PRODUCTION server.
The first recommended step is deleting all the test data in DEV, that way we won’t be passing it to PRODUCTION.
Do you remember we saved a snapshot of the database we deployed in the first release? It will be useful at this point. Restore it somewhere. Currently we have 3 databases:
4.1. Compare structure
Probably the structure of the database changed because you installed/uninstalled modules. The first thing to do is a structure comparison. Based on the comparison you can either decide: 1. to install the modules in PROD manually, so the structure keeps similar, or 2. include the structure synchronization into the script that you will run later over PRODUCTION database. The script generated could be something like:
// you installed the new memberships module
CREATE table memeberships (mid int NOT NULL PK, description varchar(50), …);4.2. Determine which changes were made in DEV and need to be migrated to PRODUCTION
We make a comparison between the DEVELOPMENT database and the INITIAL SNAPSHOT database. We exclude all the tables that we don’t want to migrate. (The same ones that we talked before… watchdog, cache, logs, etc.) The script generated could be something like:
INSERT INTO node (nid, type, title,…) VALUES (4, ‘Page’, ‘A node created in DEV after the first release’, …);
INSERT INTO node (nid, type, title,…) VALUES (5, ‘Story’, ‘Yet another node created in DEV’, …);
UPDATE TABLE variable SET value = ‘s:7:”garland”’ WHERE name = ‘theme_default’;
UPDATE TABLE variable SET value = ‘s:13:”development-mail@mysite.com”’ WHERE name = ‘site_mail’;
4.3. Determine which changes were made in PRODUCTION and need to be kept
We need to make a comparison between the PRODUCTION database and the INITIAL SNAPSHOT database. We might want to compare only the settings tables. So we should exclude all the content_* tables and logs/caching tables. The script generated could be something like:
UPDATE TABLE variable SET value = ‘s:13:”production-mail@mysite.com”’ WHERE name = ‘site_mail’;
We may want to remove some INSERTS and DELETES from this script. Why? Because this script will be run in the same production database, just to make sure that the settings are kept, and not changed by a similar UPDATE sentence generated by the other script.
You could have a bunch of options that were changed, not only in the variable table. You will have to determine which ones need to be kept.
4.4. Put all the scripts together and test them
You have to put all the scripts together. First the structure changes, second the changes to be moved from DEV to PROD, and finally, the UPDATES statements of the changes to be kept in PROD. In our sample, our script will look like this:
/* STRUCTURE CHANGES*/
CREATE table memeberships (mid int NOT NULL PK, description varchar(50), …);
/* DATA DEPLOYED FROM DEV */
INSERT INTO node (nid, type, title,…) VALUES (4, ‘Page’, ‘A node created in DEV after the first release’, …);
INSERT INTO node (nid, type, title,…) VALUES (5, ‘Story’, ‘Yet another node created in DEV’, …);
UPDATE TABLE variable SET value = ‘s:7:”garland”’ WHERE name = ‘theme_default’;
UPDATE TABLE variable SET value = ‘s:13:”development-mail@mysite.com”’ WHERE name = ‘site_mail’;
/* VALUES TO BE KEPT IN PRODUCTION */
UPDATE TABLE variable SET value = ‘s:13:”production-mail@mysite.com”’ WHERE name = ‘site_mail’;
Make a backup of your latest PRODUCTION database and test the script on that database.
4.5. Time to deploy
You have to have the code updated using SVN. Now it’s time to merge the database. I would recommend following these steps in PRODUCTION before running the script
- 1. stop running cron
- 2. put the site in maintenance mode
- 3. make a backup of database
Downtime? Just couple of minutes, and you are sure that all the changes were migrated. Isn’t this great?
What’s next?
There’s a lot more to talk about, a blog post is not enough. There are still some questions that need to be answered. I’d like to describe a complete scenario on how to implement a complete scenario and how to automate all the process, but that will be another post. -
Senior Health Data Analyst (Seattle - Capitol Hill)
[Jobs, Jobs (not Steve)] (craigslist | all jobs in seattle-tacoma)If you're an energetic individual interested in working for an exciting, entrepreneurial company, read on! OCS is a fast-paced, fun, flexible and progressive company located on Capitol Hill, and were looking for sharp contributors. Since its creation, OCS has established leadership in the healthcare industry by empowering clients with state-of-the-art business intelligence tools and the largest, most accurate national data warehouse. Today, the goal is to continue to create and elevate platfo ...
If you're an energetic individual interested in working for an exciting, entrepreneurial company, read on!
OCS is a fast-paced, fun, flexible and progressive company located on Capitol Hill, and were looking for sharp contributors. Since its creation, OCS has established leadership in the healthcare industry by empowering clients with state-of-the-art business intelligence tools and the largest, most accurate national data warehouse. Today, the goal is to continue to create and elevate platforms for easily disseminating transformative analytics to healthcare organizations.
We have an immediate opening for a Senior Health Data Analyst
Position Details:
This is a full time permanent position.
Compensation: We offer a competitive salary, DOE, + a generous benefits package.Job Description: The Senior Health Data Analyst position performs custom analysis of health care data for both internal and external clients. The position addresses all aspects of report development for research projects and publication, and independently conducts consulting work with clients. In addition, the position may involve the design and specification of new products around health care information.
Position requirements:
The candidate should possess a full range of health data management, analysis and informatics skills, including but not limited to:- Understanding of data specifications and documentation and how to work with software developers to execute specifications
- Strong communication and presentation skills, specifically the ability to clearly explain technical information
- Strong report design skills, including the ability to create readable and clearly documented tables and charts for internal and external communication
- Strong project management skills, including the ability to manage competing deadlines
- Ability to coordinate data resources across multiple client settings dealing with staff from all levels of the client organizations
- Prior clinical and/or healthcare outcomes/quality management experience
- Knowledge of and experience in using advanced statistical methods, especially to interpret complex healthcare data.
- Experience working with SQL and SAS (3+ years preferred), including use of macros, relational DB and iterative processing, statistical graphics, and efficient management of large datasets
- Microsoft SSIS and SSRS experience a bonus
- Experience working with claims and other individual or encounter level health care data
- Ability to independently develop and execute valid analytical approaches to answer specific research questions
Educational Requirements:
An advanced degree in public health, nursing, health informatics or a Bachelor's degree in a relevant area with a minimum of five years health care data analysis or related experienceAbout OCS: OCS is the leading post-acute healthcare information company, offering unparalleled insight into the multitude of factors that drive the success of a providers business. We are the keepers of the nation's most comprehensive set of integrated measures spanning clinical outcomes, financial performance, resource utilization, patient satisfaction and operational indicators for Home Health, Hospice and Private Duty agencies.
Our proprietary data products, benchmark services and analysis capabilities provide healthcare organizations and industry leaders with the critical insight into improving quality of care for patients and enhancing the effectiveness of their organizations.
Through the efforts and dedication of the OCS team, we offer cutting edge products to our 2,000+ clients, weve earned exclusive endorsements from numerous state and national associations, and weve created successful strategic alliances with major companies across our industry. We've been awarded the Deloitte Fast 50 Technology Award (measuring the Northwests fastest growing companies) for the last eight consecutive years. This work has resulted in OCS being the premier leader in the markets we serve.
At OCS we offer you more than just a good job, a paycheck and a great benefit package. OCS has built an environment that encourages creativity, innovation, collaboration and growth, balanced with a commitment to having fun, making OCS a place where nothing is routine. Helping our employees achieve and grow is essential. Its key to the world class service we deliver to our clients. At OCS we believe its important to have a good time at work, even when work is hard.
For more information about OCS and to apply for this position, please visit our web site, www.ocsys.com. More jobs and job application instructions are within our Careers section.
Local Seattle-area applicants only - no phone calls please.
OCS is an equal opportunity employer. -
Look at Data Like a Statistician, Minus the Ph. D [Statistics]
[Tech, Goodtweet (Twitter material), Hot Topics, Lifehacks] (Lifehacker)Nathan Yau is a doctoral candidate in statistics, but the most valuable lessons he's learned in analyzing and working with data don't involve formal math. Here's how he suggests looking at lines, charts, and numbers to find interesting things.Photo by net_efekt. Yau lays out the skills and mindsets that have served him well in his studies and analysis. As he puts it, he can't shoot from the hip with questions about proper sampling size or rendering formal analysis, but he's learned what to look ...
Nathan Yau is a doctoral candidate in statistics, but the most valuable lessons he's learned in analyzing and working with data don't involve formal math. Here's how he suggests looking at lines, charts, and numbers to find interesting things.
Photo by net_efekt.
Yau lays out the skills and mindsets that have served him well in his studies and analysis. As he puts it, he can't shoot from the hip with questions about proper sampling size or rendering formal analysis, but he's learned what to look for when looking at data—something we all do regularly, whether in monthly budgets or spreadsheets at work.
Two of his suggestions:
See the Big Picture
... It's important not to get too caught up with individual data points or a tiny section in a really big dataset. We saw this in the recent recovery graph. Like some pointed out, if we took a step back and looked at a larger time frame, the Obama/Bush contrast doesn't look so shocking.Ask Why
... This is the most important thing I've learned: always ask why. When you see a blip in a graph, you should wonder why it's there. If you find some correlation, you should think about whether or not it makes any sense. If it does make sense, then cool, but if not, dig deeper. Numbers are great, but you have to remember that when humans are involved, errors are always a possibility.It's not a top 10 list or secret hacks—just smart advice, and worth looking back at when you're vexed by a hidden message beneath all the numbers and lines you see in any data set.
Think like a statistician – without the math [FlowingData]
-
Healthcare Business Analyst (Washington, DC)
[Jobs] (craigslist | all jobs in washington, DC)ver ∙ si ∙ vo: [vûr-sē-vō] -noun A focused information technology and business process consulting firm committed to delivering measurable results and value to each of its clients. See also: results-oriented, passionate, committed, objective, agile, competent, and resourceful. www.versivo.com Business Analyst Position Summary Versivo is searching for a Business Analyst to join our team! This position is ideal for a mid level business analyst with experience in bot ...
ver ∙ si ∙ vo: [vûr-sē-vō]
-noun
A focused information technology and business process consulting firm committed to delivering measurable results and value to each of its clients.
See also: results-oriented, passionate, committed, objective, agile, competent, and resourceful.
Business Analyst
Position Summary
Versivo is searching for a Business Analyst to join our team! This position is ideal for a mid level business analyst with experience in both Information Technology and Healthcare. The successful candidate will be a quick-learner who can leverage his or her IT and Healthcare experience to make an immediate impact on Versivos projects from day one. With this position located at a client site, Versivo expects the successful candidate to have interpersonal skills as equally impressive as his or her analytical skills! Versivo encourages all team members to grow within the company, making this position perfect for a motivated individual looking to grow their knowledge in IT and Healthcare.
Essential Duties and Responsibilities
▪ Analyze large amounts of healthcare claims data, track trends and report on those trends
▪ Maintain data integrity across disparate information systems and internal/public databases
▪ Utilize advanced Microsoft Excel and Access skills in order to quickly compare data sets
▪ Communicate with an array of audiences, including client-site staff, various payer contacts and high level management
▪ Collaborate with other Healthcare IT project teams
▪ Identify and track issues, resolving in a timely manner
▪ Manage self and prioritize tasks
Qualifications
Candidates must possess:
▪ A four year college degree, preferably in Healthcare, IT, or a related field
▪ Three to five years of experience in Healthcare or IT
▪ Ability to rapidly acquire new skills and master new subject matters
▪ Exceptional, polished interpersonal skills, a creative mind, and a can-do attitude
Preference will be given to candidates who have:
▪ Knowledge of ASC X12 EDI billing standards, HIPAA 4010 and 5010 implementation guidelines
▪ Experience analyzing and manipulating large datasets
▪ Demonstrated success at diagnosis, management, and resolution of complex analytical and technical issues
Most importantly, the candidate should have passion, to support Versivos execution on its mission, to help our clients Compete. Advance. Thrive.
Work Environment
Professional office environment
How To Apply
If this opportunity to join the Versivo team excites you, please send your resume, cover letter, and salary requirements to resumes@versivo.com.
Versivo is an equal opportunity employer and does not discriminate in employment opportunities or practices based on race, color, religion, sex, national origin, age, or any other characteristic protected by law.
-
State-owned enterprises in China: How profitable are they?
[China] (China)In my last blog post on Chinese State Owned Enterprises (SOE), I showed that although SOEs—enterprises with the state as their biggest share holder—only make up less than 5 percent of total enterprises in China, they control almost 1/3 of total enterprise assets due to their big sizes—on average, SOEs are about 14 times larger than their non-SOE peers. Now, I turn to another key question: How profitable are they? Views on the profitability of Chinese SOEs are usually diverse. S ...
In my last blog post on Chinese State Owned Enterprises (SOE), I showed that although SOEs—enterprises with the state as their biggest share holder—only make up less than 5 percent of total enterprises in China, they control almost 1/3 of total enterprise assets due to their big sizes—on average, SOEs are about 14 times larger than their non-SOE peers. Now, I turn to another key question: How profitable are they?
Views on the profitability of Chinese SOEs are usually diverse. Some people believe they are not very efficient, lagging far behind their non-SOE counterparts, while some others think they are gold mines, generating tremendous profits. In fact, there is much to be said on both sides. AsI’ll show you next, there is a great variety among Chinese SOEs. Those in the sectors monopolized by the state generally have very good profitability, while those in the sectors with small entry barriers for non-SOEs generally record poor performance. Hence, both sides are partially right. To better understand the profitability of Chinese SOEs, one should dig deeper into the individual sectors.
At the outset, however, it is useful to look at the big picture, which is shown in figure 1. Due to data availability constraints, the analysis here is focused on industrial enterprises, using the industrial enterprise survey dataset constructed by the National Statistical Bureau of China (NBS). [Enterprises covered by this dataset in all produce almost 40 percent of GDP.] As what is shown in figure 1, at least before the subprime crisis, the profitability of industrial SOEs in China as a whole was largely in line with that of non-SOEs, only 1-2 percentage points lower in terms of return on assets (ROA). Other similar profitability indicators show roughly the same picture. However, the gap has quickly widened since the global economic crisis started to hit China in 2008.
Although the industrial SOEs as a whole show reasonable performance relative to non-SOEs, there is substantial variety on the sectorial level. SOEs in all control 44 percent of total assets in industry, but across different industrial sectors this share varies significantly. Moreover, there is a positive correlation between the performance of SOEs and their shares in individual sectors—i.e., sectors with bigger SOE shares generally see higher profitability of SOEs within—consistent with the conventional wisdom that SOEs profit from their monopoly powers. To elaborate this sectorial pattern, the industry is divided into 40 different sectors. Shares of SOEs in different sectors are defined by how many assets they control in total sectorial assets (other metrics such as share of output show similar patterns). The share of SOEs is positioned in the horizontal axes of figure 2 and figure 3. The vertical axes of these two figures represent return on assets (ROA) of SOEs (figure 2) and non-SOEs (figure 3), while the bubble size shows total profits of SOE/non-SOEs in individual sectors. Data in 2007—the last year before the subprime crisis—is used in these two figures to avoid the distortion of the global economic crisis.
Several interesting observations can be easily made with figure 2 and figure 3.
1. Shares of SOEs in different sectors are diverse. The spectrum extends from 1.1 percent in leather and fur production, a light industrial sector completely dominated by non-SOEs, to 99.3 percent in tobacco processing which is literally monopolized by the state. Generally speaking, heavy industrial sectors that play important roles in the economy have higher shares of SOEs, reflecting government’s policy of controlling the commanding heights of the national economy.
2. Profitability of SOEs is positively correlated with the share of SOEs in sectors. Such correlation is insignificant among non-SOEs. In 2007, about 19 percent of cross-sector ROA variation of SOEs can be explained by the difference of the occupation ratio of SOEs in different sectors. Moreover, some state-dominated sectors such as utility supply are subjected to low administrative pricing, leading to weak profitability of SOEs in these sectors. If these sectors are excluded from the analysis, the explanation power of SOEs shares on SOE profitability could be even higher. Meanwhile, there is no meaningful correlation (correlation coeficient less than 0.04) between non-SOE ROA and shares of SOEs in sectors.
3. Most SOE profits are contributed by sectors highly monopolized by the state while sectors dominated by non-SOEs are major sources of non-SOE profits. In figure 2, the bubble size (represents profits in an individual sector) almost increases uniformly along the horizontal axis. SOEs in the three sectors with the state controlling over 90 percent sectorial assets—tobacco processing, oil extraction, and electricity supply which are positioned at the right hand side of the horizontal axis—together contribute over 60 percent of total SOE profits in industry. Meanwhile, assets of SOEs in these 3 sectors only account for less than 40 percent of total assets industrial SOE assets. Such trend is almost reversed in figure 3, as most big bubbles stay at the left hand side of the horizontal axis there.
Inspired by the observation 3, I calculated the ROA of SOEs in those 3 highly state monopolized sectors (tobacco, oil extraction, and electricity). ROA of all SOEs in other sectors is also calculated as a reference. It turns out that SOEs in these 3 sectors have performed very well during the last decade, with their ROA outpacing that of non-SOEs for most of the time (figure 4). However, ROA of SOEs in other sectors lagged far behind by 4 percentage points on average.
Now, we can answer the question asked before: How profitable are Chinese SOEs? For SOEs in the industry, we have a clear answer. On average, the profitability of the overall industrial SOEs is roughly comparable to their non-SOEs peers. However, most of the SOE profits are contributed by highly state-monopolized sectors, in which SOEs record respectable rate of return. Profitability of SOEs in sectors with less or little state domination is generally much poorer.
Before ending this post, a kind reminder must be made to my readers. As the data used here comes from industry only, SOEs in the service sectors (financial, transport, telecom, etc.) are absent in the analysis. How the result will change if these SOEs are included remains unclear so far.
Clearly, size and profitability are only two dimensions of SOE attributes. There are more interesting questions left out there: Are SOEs overly leveraged due to their close relationship to the state-owned banks? Are they particularly suffered from over-investment? What is the role played by the SOEs in China’s recent asset price inflation? These will be topics of my future blog posts on Chinese SOEs.
-
DATA ENGINEER | YouNoodle (downtown / civic / van ness)
[Jobs] (craigslist | all jobs in SF bay area)YouNoodle is a 20-person technology startup building algorithms to track the world of emerging technology and gather intelligence on private companies. We have raised venture capital from Peter Thiel and the Founder's Fund. Our offices are located in San Francisco. We are a fun exciting bunch of people up for tackling large problems. Please read below, and if this sounds like you, get in touch right away! YOUR ATTITUDE Data geek. Be excited about opportunities to discover, collect, ...
YouNoodle is a 20-person technology startup building algorithms to track the world of emerging technology and gather intelligence on private companies. We have raised venture capital from Peter Thiel and the Founder's Fund. Our offices are located in San Francisco. We are a fun exciting bunch of people up for tackling large problems. Please read below, and if this sounds like you, get in touch right away!
YOUR ATTITUDE
Data geek. Be excited about opportunities to discover, collect, and structure some of the worlds most important information
Detail oriented. You are pedantic about details and rigorous in your pursuit of perfection with a clear eye for process.
Technical whiz. You understand the different ways data can be stored, accessed, and analyzed. And you have the scripting skills to use it for creative and innovative products.
Experience. You have experience working with large datasets and text mining.
Naturally curious. You want to know everything there is to know about new and innovative companies.
YOU WILL BE
Confident in designing, building, managing, and owning a growing and complex database. Your past experience should reflect success with something similar.
Taking large heterogeneous datasets to deliver clean, synthesized streams of data into the worlds largest database of early stage private companies.
Extracting insights from datasets using statistical / machine learning tools such as R, Weka, or using parallel transformation and processing techniques such as Map-Reduce.
Combining multiple sources of data from APIs, RSS feeds, and human input into a single, structured data store while dealing with issues of duplication, cross-referencing, relational structure, and non-relational/denormalized storage.
Developing processes to monitor and ensure data integrity.
Capturing data from human analysts and implementing QA processes.
Working with cross-functional teams, analysts, mathematicians, and engineers on projects including text mining, structuring data, and extracting entities.
Excited to find creative new ways to gain insights and information from a range of sources.
TECHNOLOGIES YOU KNOW AND LOVE
REST, XML-RPC, SOAP APIs.
Dynamic languages, such as PHP, Ruby, Python, R.
MySQL/PostgreSQL, document-based stores such as MongoDB or CouchDB, and key-value stores (Memcached, Redis); their distinct advantages, disadvantages, and trade-offs
Column-oriented stores (HBase, Cassandra), graph-oriented stores.
Please note, only candidates authorized to work in the United States will be considered. -
Human microbiome and personalized medicine
[Future] (Broader Perspective)In genomics, the eleventh annual meeting of Advances in Genome Biology and Technology (AGBT) was held February 24-27, 2010, and featured an eclectic mix of new research and bioinformatics tools. Genomic research was presented in a diversity of areas including human, animal, plant, and bacteria. Many research advances are coming from partnerships between one or more academic research teams together with commercial entities. The biggest buzz was around Pacific Biosciences, the 3rd generation seque ...
In genomics, the eleventh annual meeting of Advances in Genome Biology and Technology (AGBT) was held February 24-27, 2010, and featured an eclectic mix of new research and bioinformatics tools. Genomic research was presented in a diversity of areas including human, animal, plant, and bacteria. Many research advances are coming from partnerships between one or more academic research teams together with commercial entities. The biggest buzz was around Pacific Biosciences, the 3rd generation sequencing darling, with their single-molecule real-time (SMRT) platform which is still on track for an estimated launch later this year. The platform could deliver a 30,000-fold improvement over current methods, and ultimately achieve sub-$100 whole human genome sequencing. Attendees were also wowed by 454 Roche’s bench top GS Junior System (initially announced in late 2009), making sequencing much quicker and easier, and priced at only $98,000 (a milestone for sequencing equipment which usually runs in the several hundreds of thousand dollars).
Sequencing data storage and transfer costs continue to increase with the computing industry still not cognizant of the whole new era of data processing and communications transfer that is necessary for Very Large Datasets. The NIH 1000 Genomes project, for example, is transferring many terabyte-sized files per day.
From a research standpoint, some of the most activity is in cancer genomics. A recent NIH study generated 100TB data sequencing a melanoma sample and a normal blood sample and has been refining the Most Probable Variant (MPV) Bayesian analysis method used to identify genetic mutations. Perhaps the most innovative new research activity is in RNA sequencing. Other specific findings of note are in the areas of the microbiome and genetic variation:
Human microbiome
The complex interactions between individual humans and their microbiomes could have a substantial impact on personalized medicine. In some cases of infectious disease in humans, the pathogenesis may be unknown 40-60% of the time (e.g.; respiratory disease, skin disease). Even rudimentary issues remain unsolved, for example, it may be undetectable from a simple blood draw showing staph infection whether the bacteria was on the skin surface or in the blood. Microbiome sequencing is allowing the identification of novel pathogens, and could also be useful at the human population level to assess the spread and mutation trajectory of pathogens.
Genetic variation: human and otherwise
The populations analyzed in human genome wide association studies are being expanded, with important findings for both ancestry reconstruction and medical genomics. Research was presented on African-American, Mexican-American, Bushmen, and Bantu genome studies. A deeper understanding of genetic variation is also being used to facilitate the selection of desirable qualities in agriculture and animal livestock. For example, a chicken sequencing project found 7 million unique SNPs, 5 million of which were novel, and several of which were useful in translational application. -
Senior Marketing & SEM Manager (Seattle )
[Jobs] (craigslist | all jobs in seattle-tacoma)The Sr. Marketing Manager is responsible for driving nearly half of WhitePages traffic (tens of millions of visits per month) through Search Engine Marketing (Search Engine Optimization and Pay Per Click Advertising) and other online advertising. The Sr. Marketing Manager oversees a PPC Marketing specialist and a multi-million dollar advertising budget. You dont have to convince us that SEO/SEM is important: we absolutely get that & youll have the dev/design team and budget to execute. ...
The Sr. Marketing Manager is responsible for driving nearly half of WhitePages traffic (tens of millions of visits per month) through Search Engine Marketing (Search Engine Optimization and Pay Per Click Advertising) and other online advertising. The Sr. Marketing Manager oversees a PPC Marketing specialist and a multi-million dollar advertising budget. You dont have to convince us that SEO/SEM is important: we absolutely get that & youll have the dev/design team and budget to execute.
This is not your average SEM shop. Weve been there, done that, and for the most part, weve nailed the traditional SEO best practices. Were driving tens of millions of monthly visits based on a long-tail program that has tens of millions of indexed pages. We need someone up to a greater challenge: we need someone who can make a top-tier program even better. Weve got a massive amount of content, great brand (and in-links), and a kick-%$#@ dev and product team who can move aggressively after the right opportunities. We need someone who can identify those opportunities and define the vision to accomplish them fast.
This person directly contributes to revenue by attracting new customers for people search and business / local search (web and mobile) while maximizing advertising spend ROI. S/he will be maniacally focused on data to drive traffic, conversion and ROI. S/he also acts as a business owner to identify, execute, and measure improvements to our online marketing programs, with particular emphasis on SEO and PPC. The Sr. Marketing Manager will act as Product Owner for agile sprints relating to SEO product development, including driving the business case, requirements, day to day trade-offs with development and design, and go-to-market planning and execution. The Sr. Marketing Manager will ensure that our SEO programs not only dramatically grow traffic, but also reflect the highest standards for fantastically simple customer experiences, brand alignment, and engineering design (including page performance).
Job Responsibilities
Responsible for all aspects of acquiring new users to WhitePages web properties (including WhitePages.com, 411.com, and a number of smaller portfolio sites) through online media management including pay-per-click (PPC), search engine optimization (SEO), and display advertising. This includes planning, analysis, campaign optimization, and reporting.
Acts as resident SEO expert, keeping up-to-date on best practices and key changes in search engine methodology, and regularly educating WhitePages employees.
Actively tests new advertising concepts, including PPC copy, landing pages, online advertising creative, etc.
Product owner on agile development team for SEM related sprints. Responsible for SEO product backlog and prioritization, including business case development and requirements/user stories.
Oversee the Analysis function for Marketing. Includes:
Establish social media channels in which to promote awareness of WhitePages and drive increased SEO benefit
Understand, support, and exemplify WhitePages values, mission, vision and brand pillars through actions and behaviors.
Management Responsibilities
· Responsible for interviewing and hiring within department.
· Responsible for timely and thorough completion of PDPs and Performance Reviews of direct reports.
· Participate in salary reviews and adjustments.
· Responsible for coaching and mentoring of employees through PDP process and through daily interaction.
· Responsible for implementing programs to promote a strong and cohesive team environment.
· Understands company goals & direction and how that translates into department goals. Communicates department goals to direct reports and assists individuals in aligning personal goals to department goals.
· Responsible for addressing performance issues in an effective and timely manner
Technical knowledge, skills and abilities
· Bachelors degree required
· Minimum 5 years experience with mass-market consumer web traffic and customer acquisition marketing with a focus on PPC and SEO
· Deep technical understanding of the mechanics of SEO. Proficient in HTML and web technologies.
· 2+ years experience driving and growing a large-scale SEO program (minimum 1MM indexed pages)
· Experience working with and analyzing very large datasets and log files to unearth SEO opportunities and evaluate bot traffic in order to adjust tactics
· 3+ years data analytics experience. Highly skilled in Excel for modeling, budgeting, and reporting purposes. Omniture experience a plus.
· Experience with live site testing (Google Web Optimizer a plus) to improve conversion rates and click throughs
· Demonstrated advanced communication and presentation skills, both verbal and written
· Expertise in MS Office including Excel, Word, Outlook, and PowerPoint
Company Description
What We Do:
WhitePages is your go-to source for the most reliable contact information online. With accurate information for almost 200 million U.S. adults, we make finding and connecting with others incredibly simpleand incredibly free.
Our web properties are consistently top-ranked and we power the people search sections for several of the largest Internet properties including MSN, AOL, Superpages.com and even the official United States Postal Service Website. Driven by a spirit of innovation and a commitment to delivering consumer value, WhitePages is a highly profitable company and has been since incorporation in 2000. Help us break new ground in People Search as we execute on ambitious plans in 2010 and beyond!
Why You Want To Work Here:
A profitable, diverse and rapidly growing company, we were ranked a best place to work by Washington CEO Magazine, Seattle Business Monthly and Seattle Metropolitan Magazine. We were also recognized by Inc. Magazine as one of the Fastest Growing Private Companies in North America. Our headquarters is located in the heart of downtown Seattle surrounded by great restaurants and interesting places! Bottom line We're a small, highly profitable company and our people are smart, passionate, entrepreneurially spirited team players with high integrity who like a challenge and have fun doing it!
We're always looking for bright, ambitious, talented people who share our values.
* We're on a mission. We're passionate about this once-in-a-lifetime chance to revolutionize how people connect.
* Work with the best people. We have big things to accomplish. To succeed we must hire, develop, and retain the best people.
* One mission, one team. We encourage passionate debate and then unite in execution.
* Work and play hard. We take our work seriously and ourselves lightly. We love what we do and have fun along the way.
* We love to win and really, really hate to lose. We set aggressive, obtainable goals and hold ourselves accountable for achieving them.
* We're entrepreneurial. We think big and act small. We see the big picture, yet act frugally and quickly. We're smart risk-takers and we loathe bureaucracy.
WhitePages is an equal opportunity employer! -
Academic attempts to take the hot air out of climate science debate | Leo Hickman
[Guardian] (Environment news, comment and analysis from the Guardian | guardian.co.uk)Judith Curry aims to turn inflammatory debate of 'climategate' into reasoned online discussions to rebuild trust with the publicProfessor Judith Curry, who currently chairs the Georgia Institute of Technology's School of Earth and Atmospheric Sciences, has embarked on what she's describing as a "blogospheric experiment". Having written a lengthy essay entitled Losing the Public's Trust which will be published later today, she decided to alert many bloggers across the climate change debate in "t ...
Judith Curry aims to turn inflammatory debate of 'climategate' into reasoned online discussions to rebuild trust with the public
Professor Judith Curry, who currently chairs the Georgia Institute of Technology's School of Earth and Atmospheric Sciences, has embarked on what she's describing as a "blogospheric experiment". Having written a lengthy essay entitled Losing the Public's Trust which will be published later today, she decided to alert many bloggers across the climate change debate in "the hope of demonstrating the collective power of the blogosphere to generate ideas and debate them". She has asked the likes of Anthony Watts, Andrew Revkin, Roger Pielke Jr, among many others, to pitch in with their own thoughts about her essay with the goal of "bringing some sanity to this whole situation surrounding the politicization of climate science and rebuilding trust with the public". I genuinely hope she achieves her aims.
As and when other bloggers publish their own responses I will try and provide links to them below, but here are my own thoughts on Curry's article. First, I agree with her opening premise that "credibility is a combination of expertise and trust" and that the climate research establishment has failed to understand that the "climategate" furore is "primarily a crisis of trust".
In their misguided war against the skeptics, the CRU emails reveal that core research values became compromised. Much has been said about the role of the highly politicized environment in providing an extremely difficult environment in which to conduct science that produces a lot of stress for the scientists. There is no question that this environment is not conducive to science and scientists need more support from their institutions in dealing with it. However, there is nothing in this crazy environment that is worth sacrificing your personal or professional integrity. And when your science receives this kind of attention, it means that the science is really important to the public. Therefore scientists need to do everything possible to make sure that they effectively communicate uncertainty, risk, probability and complexity, and provide a context that includes alternative and competing scientific viewpoints. This is an important responsibility that individual scientists and particularly the institutions need to take very seriously.
If the "climate research establishment" is to take away one lesson from this sorry episode it will surely be the need to "effectively communicate uncertainty, risk, probability and complexity, and provide a context that includes alternative and competing scientific viewpoints".
Up to this point I strongly agree with Curry's sentiments, but I think she is a little complacent in her assessment of the "changing nature of scepticism about global warming". She correctly identifies that climate scepticism is a multi-headed and ever-shifting beast. There are as many flavours to the sceptics as there are to environmentalists. To label them all as flat-earthers and big oil deniers is just as ill-judged and lacking in subtlety as labelling all environmentalists as "eco-Nazis intent on taking us all back to the caves". Genuine climate science sceptics such as Climate Audit's Steven McIntyre are a world apart from the out-and-out denial pumped out by the likes of Prison Planet's Alex Jones. Somewhere in between are the likes of Anthony Watts who risks polluting his legitimate scepticism about the scientific processes and methodologies underpinning climate science with his accompanying politicised commentary. But Curry bags them up together and describes Watts and McIntyre both as "climate auditors":
They are technically educated people, mostly outside of academia. Several individuals have developed substantial expertise in aspects of climate science, although they mainly audit rather than produce original scientific research. They tend to be watchdogs rather than deniers; many of them classify themselves as "lukewarmers". They are independent of oil industry influence. They have found a collective voice in the blogosphere and their posts are often picked up by the mainstream media. They are demanding greater accountability and transparency of climate research and assessment reports… So how did this group of bloggers succeed in bringing the climate establishment to its knees (whether or not the climate establishment realizes yet that this has happened)? Again, trust plays a big role; it was pretty easy to follow the money trail associated with the "denial machine". On the other hand, the climate auditors have no apparent political agenda, are doing this work for free, and have been playing a watchdog role, which has engendered the trust of a large segment of the population.
I think Curry has misjudged this point a tad. If the "climate auditors" were exactly as billed above I would agree they are a most welcome addition to the debate. But to claim these blogs have no political agenda is naïve, I feel. Granted, both McIntyre and Watts do make regular efforts to tone down some of the very worst off-topic comments that follow their posts, but it doesn't take much analysis to know where the political heartbeat of these blogs lies. For right or wrong, they have attracted a particular crowd of followers – predominantly right-wingers in favour of the free-market and libertarianism – and it must be a difficult horse for McIntyre and Watts to ride at times without playing to the crowd.
Curry goes on to say:
There is a large group of educated and evidence driven people (eg, the libertarians, people that read the technical skeptic blogs, not to mention policy makers) who want to understand the risk and uncertainties associated with climate change, without being told what kinds of policies they should be supporting.
I think this is an important point. Some sceptics such as Bjørn Lomborgand Nigel Lawson have made a very conscious shift in their stance in recent years away from one that questioned the science to one that now largely focuses on questioning the policy responses to climate change. If we are to have a fierce, politicised debate let it lie here, surely. But let's keep the politics out of both the climate science and those that choose to try and audit it via their blogs.
And it is on this point that I think Curry makes her most powerful point:
While the blogosphere has a "wild west" aspect to it, I have certainly learned a lot by participating in the blogospheric debate including how to sharpen my thinking and improve the rhetoric of my arguments. Additional scientific voices entering the public debate particularly in the blogosphere would help in the broader communication efforts and in rebuilding trust. And we need to acknowledge the emerging auditing and open source movements in the internet-enabled world, and put them to productive use. The openness and democratization of knowledge enabled by the internet can be a tremendous tool for building public understanding of climate science and also trust in climate research.
I, too, think it would be a grave mistake not to make better use of the obvious open-source and crowd-source advantages enabled by blogs such as Climate Audit. Just as the SETI@Home project has made use of thousands of otherwise idle computers to scan radio telescope data for signs of extraterrestrial life, if people are willing and able to interrogate climate datasets in their spare time it would be strange in my view not to try and make use of this collective resource.
But the key for me is that word "trust" again. I think until those that frequent these sites come out from behind the cloak of anonymity that most of them choose to hide behind very few people, particularly climate scientists, will be willing to trust the motives of this army of DIY auditors. Anonymity allows for some spicy free speech beneath blogs such as this one, but it is not the right tool if we're seeking the "openness and democratization of knowledge". If we are to once again try and drive a wedge between science and politics, then all the participating actors – on both sides of the debate - need to be open about who they are and where their motives and vested interest, if any, lay.
guardian.co.uk © Guardian News & Media Limited 2010 | Use of this content is subject to our Terms & Conditions | More Feeds -
Rise in deaths linked to obesity
[England] (Health News from NHS Choices)There has been a “dramatic rise” in obesity-related deaths according to the BBC. The news is based on a study looking at 27 years’ worth of death data, focusing on whether obesity was listed as the main cause of death or only a contributing one. The researchers warn that obesity-related death may be more common than believed because it is rarely listed as the main cause of death. While this study was actually examining the process of recording deaths, it highlights the important associati ...
There has been a “dramatic rise” in obesity-related deaths according to the BBC. The news is based on a study looking at 27 years’ worth of death data, focusing on whether obesity was listed as the main cause of death or only a contributing one. The researchers warn that obesity-related death may be more common than believed because it is rarely listed as the main cause of death.
While this study was actually examining the process of recording deaths, it highlights the important association between obesity and health - the researchers say that one possible reason for the increase in obesity-related death recordings is due to a real increase in obesity prevalence. The findings of this research will be important to public health practitioners or researchers who use death records to monitor obesity-related deaths.
Where did the story come from?
The study was carried out by Dr Marie Duncan and colleagues from the University of Oxford’s Department of Public Health and the National Obesity Observatory. The study was funded by the English National Institute for Health Research via its National Coordinating Centre for Research Capacity Development. The research was published in the European Journal of Public Health.
This time series study has demonstrated increased certification of obesity as a cause of death in England, although it is usually selected as a contributing cause rather than an underlying cause.
What kind of research was this?
A number of health risks are associated with obesity and the condition leads to an increase in overall mortality. This was a time series study analysing the changing trends in obesity-related deaths over time. According to its authors, figures from the 2007 Health Survey for England show that 24% of men and 25% of women are classified as obese.
What did the research involve?
The researchers used two separate datasets to investigate the trends in obesity-related mortality - the Oxford record linkage study (1979-2006) and English national mortality data (1995-2006). The researchers say the Oxford study is considered to be the “longest continuous run of systematic, ready-to-analyse, coding of all mentions on death certificates in a large defined population in England”. The English national mortality dataset also provides all certified causes of an individual’s death, not just the underlying cause of death. Both datasets were searched for mentions of obesity.
Researchers then used each dataset to calculate age-specific mortality rates in five-year age bands so that they could calculate an “age-standardised mortality rate” for each age group. This means that they standardised the death rates in the different datasets against a theoretical population that has the same age structure as England. In this way, death rates from the two datasets were made comparable with each other and the national situation.
The researchers analysed the Oxford data within four time periods corresponding to changes in the regulations about recording deaths - 1979–83, 1984–92, 1993–2000 and 2001 onwards. This dataset was used to see whether changes in directives about coding led to any changes in the way obesity-related death was recorded. The national English dataset was used to assess whether there were any significant increases or decreases in deaths associated with obesity.
Changes in coding
From 1984, the rules governing the selection of the underlying cause of death changed - revision of the International Classification of Diseases (ICD) specified that certain diseases, which can be modes of dying rather than causes of death, should not be recorded as the underlying cause if another ‘primary’ condition is present. There were further changes in 1993, which saw the introduction of automatic coding software by the Office for National Statistics and the use of multiple-cause coding become standard practice in England.
What were the basic results?
Of 656,443 deaths recorded in the Oxford data, obesity was a certified cause of death in 1,002 cases (0.15%). Obesity was recorded as the underlying cause of death in 26% (259/1,002) of these.
The researchers then analysed deaths in relation to periods of different coding practice. The proportion of obesity-related deaths with obesity as an underlying cause was 22.2% in 1979–83, 36.4% in 1984–92, 25.8% in 1993–2000 and 17.4% in 2001–06. The researchers say that the increase between the first two periods and the decrease between periods two and three were both statistically significant and “coincided with coding rule changes”.
English national mortality data from 1995 to 2006 showed that obesity was a certified cause of death in 8,450 of 6,054,897 deaths (0.14%). It was recorded as the underlying cause of death in 24.8% of these. The percentage of all deaths in England with obesity on the certificate doubled from 0.11% in 1995 to 0.23% in 2006. The researchers estimated that this represented an average annual increase of 7.5% for men and 4.0% for women
How did the researchers interpret the results?
The researchers concluded that: “There is an emerging trend of increased certification of obesity as a cause of death in England.” They also say that relying on the underlying-cause mortality statistics alone “fails to capture the majority of obesity deaths”.
Conclusion
This study highlights important issues surrounding the complex nature of recording of the causes of death. In their work the researchers note that there is an increase in the recording of obesity as a cause of death, but that it is usually noted to be a contributing, rather than underlying, cause. The researchers also say that until recently in England, only one underlying cause of death from each death certificate is used for routine coding and analysis in national systems. There are problems with this approach, including missing data on contributing causes.
The overall increase in certification of obesity-related deaths also suggests that there is a better way to use these routine statistics to assess mortality - studies that assess mortality based only on obesity as a primary cause of death would have missed this increase. The researchers also make a sensible recommendation over using broader measures when monitoring obesity-related deaths in public health planning, saying that “public health practitioners should consider the importance of using all certified causes of death and not just the underlying cause”.
The researchers say that it seems likely that the increase in obesity-related deaths noted in their study is linked to increasing prevalence of obesity, but that there are other potential reasons for these changes. These include an increase in the prevalence of disease, an increase in severity of the disease (i.e. obesity at levels more likely to kill), increased clinical awareness and certification practice changes such as an increase in willingness to certify obesity.
Links To The Headlines
Obesity rise on death certificates, researchers say. BBC News, February 22 2010
Links To Science
Duncan M, Griffith M, Rutter H and Goldacre MJ. Certification of obesity as a cause of death in England 1979–2006. The European Journal of Public Health [Advance access published online] February 2 2010
-
3 FREE ways to analyse your website backlinks
[Windows] (MSDN Blogs)This week I was pleased to run a side session at the MVP Summit on Search Engine Optimisation. One of the topics I covered was SEO tools, as part of which I various tools which are available for analysing websites and extracting data to understand how the crawlers see individual pages or sections of the sites. Whilst there are a many different SEO analysis tools available which provide a variety of features and datasets, I personally favour those which enable me to extract raw data and then m ...
This week I was pleased to run a side session at the MVP Summit on Search Engine Optimisation. One of the topics I covered was SEO tools, as part of which I various tools which are available for analysing websites and extracting data to understand how the crawlers see individual pages or sections of the sites.
Whilst there are a many different SEO analysis tools available which provide a variety of features and datasets, I personally favour those which enable me to extract raw data and then manage it in Excel, Access, SQL Server or another tool.
Within the Microsoft support SEO project we have found backlink data extremely useful to analyse. When I saw ‘Backlink data’, I am referring to information to show which sites are linking to your site, and which pages on your site they are linking to. This data can be extremely valuable for identifying additional link building opportunities, finding problems with invalid URLs pointing to your site and discovering the most powerful pages/sections of your site in terms of PageRank. Since the only way to know which websites are linking to yours is to crawl the web (and it’s pretty big), backlink data can only be generated by search engines or companies who have their own web crawlers. The good news is that you can get a complete list of backlinks pointing to your website, for free. In fact depending on where you get the data from, you can even more data than just the links them self…
Method 1: Bing webmaster tools
Once you are validated for your Bing webmaster tools account, you will be able to view and export a list of backlinks which Bing knows about pointing to your website…
Unlike the two other solutions below, Bing will not currently provide you with a list of individual pages on your site where the backlinks are pointing TO. However, Bing does provide a nice filtering feature which enables you to see only links coming from a particular domain, subdomain or directory. E.g. here is a the list of backlinks filtered for support.microsoft.com coming from www.microsoft.com/uk…
Bing will allow you to export the results in to Excel or another tool, however it will currently only allow you to export 1000 results.
Method 2: Google Webmaster tools
Google webmaster tools also provide backlink data for your website. Google allows you to see exactly which of the pages on your site have the most links…
Google will also allow you to export the data, but does not limit you to 1000 backlinks, so you can download EVERY single link which is pointing to your site, and the URL which it is pointing to. If you have a big site, the file will be pretty large, so you may not be able to load it straight to Excel. We recently extracted this file for support.microsoft.com (as a comma delimited file) and then imported it in to Microsoft Access using the External data import option….
The Google data also contains links from within the same domain as the website. We recently used this information to analyse the links pointing to http://support.microsoft.com from local (non-English) www.microsoft.com pages, so that we could optimise the links to point to local content, and increase the search relevancy for our international customers.
You get a lot of data when exporting from Google, but it can be very interesting and useful to analyse. It’s also worth considering comparisons between the Bing and Google data, to use as an indicator to differences in the information the two search engines may have about your site.
Method 3: Majesticseo.com
Majesticseo.com are a company who have their own web crawler (as Bing and Google do), their own web index built from all of the pages they regularly crawl (as Bing and Google do), but the difference is that Majesticseo.com provide the ability to analyse and extract the data within their web index for SEO analysis.
Whilst they do provide paid for services if you are interested in analysing data for other (i.e. competitor) sites, they provide FREE access to data about your own site if you validate yourself as an owner.
They do provide some web based analysis tools, although in my opinion, the real power behind the Majestic SEO data comes when you export the file and pull it in to your favourite database application. We now regularly extract this information, and load it in to an SQL server for analysis.
Like Google, Majestic SEO allows to export ALL data for your site, however they go a step further by providing an ‘ACRank’ value, which is their version of PageRank and provides an indicator to the ‘strength’ of every page on your site in terms of the number, diversity and quality of inbound links pointing to it. Majestic SEO ranking value is based on a scale of 0 to 15, rather than 0-10 like Google’s PageRank.
We are currently using the Majestic SEO data to identify top ranked pages on support.microsoft.com, and we have had a couple of surprises! For example, this page is one of highest ranked pages…
http://support.microsoft.com/gp/howtoscript
…which is simply a page we use to notify customers how to enable scripting if they have it disabled. The reason this is ranking so highly is because we link to it by default on most of our pages if customers have scripting disabled, but also customers have discovered the page and decided to link to it from their own sites to instruct their users how to enable scripting.
This data has lead to many more useful insights for us. If you are interested in knowing which sections/pages of your site are getting the most link juice flowing in to them, I really recommend downloading your Majestic SEO data.
So there you have it, three ways of understanding what the search engines crawlers know about your site. Enjoy :-) Let me know if you come up with any clever ways of using this data – I would love to do a follow up blog post in future!
Author: Chris Moore is a Program Manager working on Search Engine Optimisation at Microsoft. http://www.twitter.com/chrismdotcom
Share -
Delivering software to support the cloud
[Corporate Blogs] (Blogs.oracle.com Recent Posts (English-language only))From a software perspective, developing a cloud strategy is all about the data and not moving it. For a long time know Oracle has advocated the basic principle of doing everything inside the database. When you move to the cloud this makes even more sense because you do not want to be continually unloading, moving and reloading data into different engines. Many of the data warehouse vendors actively promote the use of multiple processing engines to support their data warehouse solution. As a resu ...
From a software perspective, developing a cloud strategy is all about the data and not moving it. For a long time know Oracle has advocated the basic principle of doing everything inside the database. When you move to the cloud this makes even more sense because you do not want to be continually unloading, moving and reloading data into different engines.
Many of the data warehouse vendors actively promote the use of multiple processing engines to support their data warehouse solution. As a result you get something like this:
Unfortunately, this is exactly the approach being put forward by the same data warehouse vendors in an attempt to get customers to move to the cloud. Their view of an EDW in the cloud looks this:
In this scenario data is being loaded, unloaded, moved and reloaded multiple times which increases latency, allows errors to be introduced and makes it difficult to determine exactly where a piece of data actually came from. There is also the topic of data security to consider - and that is a big topic! All this data movement, unloading and reloading provides numerous opportunities for security breaches.
If you are going to develop a viable data warehouse cloud strategy then what is needed are some simple rules that can be used to check the suitability of your preferred database platform:- Flexible data model not fixed data model
- Data loading based on ELT not ETL
- Analytics inside the database not outside
- End-to-end security not disconnected security
- Effective resource management not ineffective resource management
Data Model Strategy - > Flexible not fixed data model
Oracle Database is not restricted to a single type of data model. This provides the required flexibility to provide a data model that can support real-time data loading as well as the complex analytics needed to support today's BI queries. Most importantly, as the business changes (new companies acquired, new products added and old products decommissioned) it is important to have a data model that can easily move with the business and not hold it back. This is especially true when considering a cloud-based strategy for the data warehouse since one of the main drivers of moving to a cloud based environment is "increased flexibility".
Oracle has developed and proven its reference architecture through numerous customer engagements over the last 14 years During this time the model has evolved to build on the changes in the capability of the underlying database technology and tools. Each new release of the Oracle Database adds new data warehousing, security and availability features that make it significantly quicker and easier to implement this reference architecture.
The goal of Oracle’s Data Warehouse Reference Architecture is to deliver a high quality integrated system and information at a significantly reduce cost over the longer term. It does this through recognizing the differing needs for Data Management and Information Access that must both be delivered by the Warehouse, applying different types of data modeling to each in a layered and abstracted approach.
The Reference Architecture is intended as a guide and not an instruction manual. Each layer in the architecture has a role in delivering the analytical platform required to support next generation business execution. The Reference Architecture gives us something to measure back against so we can understand what we compromise by making specific architectural, technical and tools choices. It works equally well for new Data Warehouse developments as it does for developing a roadmap to migrate an existing ones.
Below is an overview of the main elements of the reference architecture:
Data loading strategy-> ELT not ETLMany data integration (DI) tools rely on their engines to perform data transformations this is because many databases have very weak data transformation engines. Therefore, most DI tools extract data from a source system, move that data into their own processing engine and perform transformations in a row-based manner. Finally, the data is then pushed into the target - the data warehouse. The situation gets more complex when you start including processes to manage data quality, data lineage, data discovery etc etc.This approach means customers have to manage multiple servers and their network takes a beating every time the ETL jobs are run because large data sets are moved around the network being passed from engine to engine. Yet the need to use multiple engines with associated dedicated hardware is often cited as an excellent reason for moving to a cloud based strategy. This removes the need to manage all those servers and software licenses. Data can freely move around the cloud taking advantage of the latest versions of each piece of software.
Yet it is the volume of data and the complexity of the transformations (ETL and data quality) that makes it vital that processing is down within the data warehouse database engine under the control of the database workload management features. The Oracle Database has specialized and optimized data transformation features such as set-based operations, error logging, pipeline table functions, regular expressions.
The Oracle Database license includes the Warehouse Builder (OWB) which follows the approach of extracting the data from the source system, loading into the target (DW) and then applying the required data transformations using the power of the Oracle Database. Customers using OWB do not need to buy additional hardware to run their ETL or additional tools beyond their normal enterprise database license. Therefore, using OWB it is possible to "cloud-enable" the ETL process directly within the database.
As all ETL jobs are under the control of the database workload manager the priority and access to resources can be managed from one central console. Using OWB's macro language ("experts") it is possible to write wrappers around normal processes that users might want to do such as load the contents of an Excel worksheet into a table. This way, users can "build" and execute their own ETL jobs using the same ETL tools and repository as the IT team. Then when something needs to be changed the impact on the whole environment can easily be determined.
Processing data inside the database makes sense. Take the analysis to the data not the other way round!
Analytics Strategy -> inside the database not outside
As with ETL it makes sense to do as much processing inside the database as possible since this is where all the data and real processing power is located. Personally, I think the challenge for most Oracle customers is knowing what is inside the database. The latest version of Enterprise Edition offers data mining, OLAP/multi-dimensional models, spatial, text mining, and support for unstructured data. By keeping all these types analysis within the database engine it is possible to run cross-functional analysis that is simply not possible in other data warehouse databases/engines.Imagine being able to analyze the result from a data mining model using spatial analytics and then applying a top 10 and bottom 10 query to highlight winners and losers? Could you do this using the cloud? Of course, but it would probably mean unloading data from the enterprise data warehouse into a data mining engine and then pushing the results to a spatial engine and creating a federated query across the spatial and data warehouse datasets to run the winner and losers query. That all takes time and time is what most business users do not have - even without considering who is going to write the ETL to move all that data around!
End-to-end security not disconnected security
One of the biggest challenges around cloud computing is data security. Why? Because data is continually on the move from one engine to the next and all that movement is not encrypted, some engines have an encryption process (usually unique to them) and others have nothing. How do you know who is accessing your most sensitive data and more importantly how do you know where it is being moved to?
There is an easy answer to this: don't put sensitive data in the cloud! The only problem is that the sensitive data is usually the gateway to a lot of very important analysis. Therefore, you either stop moving data around, or you apply strong encryption and authorization policies or you do both. Fortunately, Oracle offers both! Using Oracle Database as the foundation of a DW cloud strategy means you can use Oracle's transparent security features to lock down sensitive data and stop unauthorized access. Data remains locked inside the Oracle Database where you can use the built-in analytic power to run queries across effectively secured data sets.
Effective resource management not ineffective resource management
If you are going to manage resources with the cloud in an effective way then you need to be able to control all aspects of the data warehouse workload. Most database systems, including those with cloud platforms, will provide some degree of control over the processing directly within the database.The Oracle Database Resource Manager (DBRM) allows the DBA to prioritize workloads and restrict access to resources for certain groups of users. This allows the to protect high priority users or jobs from being impacted by lower priority work. The DBRM does this by allocating CPU time to different jobs based on their priority. The amount of resources allocated to a specific workload or user can depend on the percentage of CPU time, number of active sessions, and amount of space available etc etc.
The addition of Exadata to the data warehouse platform provides the Oracle DBA with one significant advantage for managing workloads: it extends DBRM's capabilities to include the coordination and prioritization of I/O bandwidth consumed between databases, and between different users and classes of work. This is only possible with Oracle and is the direct result of the tight integration between the database with the storage layer. Exadata is aware of what types of work and how much I/O bandwidth is consumed. Users can therefore have the Exadata system identify various types of workloads, assign priority to these workloads, and ensure the most critical workloads get priority.
To support a data warehouse cloud strategy that supports both data warehousing and/or mixed workload environments, you may want to ensure different users and tasks within a database are allocated the correct relative amount of I/O resources. For example you may want to allocate 70% of I/O resources to interactive users on the system and 30% of I/O resources to batch reporting jobs. This is simply not possible, or at best extremely complex to achieve, with other vendors databases. With Oracle, this is simple setup and enforce using the DBRM and I/O resource management capabilities of Exadata storage.
Summary
With Oracle Database 11g you get an integrated and complete software platform to support a cloud strategy
In this model Oracle Database provides an intelligent cloud, or iCloud, compared to the more traditional "dumb" cloud which is being heavily promoted by many of the current data warehouse vendors as they rush to prove their cloud credentials. Oracle offers "iCloud" as the way forward for your data warehouse strategy, which really is the only way forward.
-
Bioinformatics (Seattle)
[Jobs] (craigslist | all jobs in seattle-tacoma)Bioinformatician Integrated Diagnostics, a molecular diagnostics company, seeks to add a Bioinformatician to join our growing team. The Bioinformatician is a member of the scientific team responsible for selecting biomarker candidates on targeted diseases, analyzing proprietary and public datasets derived from genomic, transcriptomic, and proteomic analyses, and providing bioinformatics support for all company projects. The position reports to the Director of Bioinformatics. The miss ...
Bioinformatician
Integrated Diagnostics, a molecular diagnostics company, seeks to add a Bioinformatician to join our growing team. The Bioinformatician is a member of the scientific team responsible for selecting biomarker candidates on targeted diseases, analyzing proprietary and public datasets derived from genomic, transcriptomic, and proteomic analyses, and providing bioinformatics support for all company projects. The position reports to the Director of Bioinformatics.
The mission of Integrated Diagnostics is to leverage powerful emerging technologies in the development of diagnostic products that enable physicians and patients to manage complex and important diseases such as cancer, lung disease and CNS diseases through blood tests that can monitor tens to hundreds of disease markers simultaneously. Integrated Diagnostics was founded by Lee Hood of the Institute for Systems Biology.
Responsibilities include but are not limited to:
Developing software to automate data analysis and data mining.
Analyzing large-scale genomics, transcriptomics, and proteomics datasets to support the identification and the validation of biomarker candidates.
Developing software to facilitate the mining of public and private knowledge databases such as PubMed, DAVID, Ingenuity, cancer genome atlas, Gene Ontology, OMIM, iHOP, EntrezGene, etc..
Collecting and managing information from public data repositories such as PeptideAtlas, GEO, caBIG, etc..
Performing pathway/network analysis on genes or proteins of interest.
Developing informatics to store crucial information on biomarker candidates and build user-friendly interfaces for others to mine the data
Documenting and reporting progress in timely fashion.
Exploring and learning new bioinformatics tools developed by others.
Qualifications:
A MS or Ph.D. degree in bioinformatics, computer science, mathematics, physics, statistics, or engineering is required.
A minimum of 3 years of industrial work experience participating within a team oriented environment.
Strong programming skills and proficiency with script languages (such as PERL) and computing languages (C, C++, or Java) are essential.
Experience with Excel, pathway/network analysis, data management, and data mining on public databases is required.
Record of developing new bioinformatics tools is preferred.
Hands-on experience on analyzing large-scale genomic, transcriptomic and/or proteomic datasets is desirable.
Experience on analyzing large-scale mass spectrometry datasets is a huge advance.
Working knowledge on relational database, R/Bioconductor, and Matlab are assets.
Ability to work under both Windows and Linux/Unix systems is a plus.
The ideal candidate would be collaborative, reliable, creative, curious, a team player, and a self-motivator. He/she must have good interpersonal and communications skills; be adaptable to changing work requirements, willing to multi-task, and eager to take new challenges; hold high degree of professional integrity.
For consideration, please email a resume to hr@integrated-diagnostics.com.
Integrated Diagnostics is an EOE and offers a competitive salary. For more information on Integrated Diagnostics, please visit our website http://www.integrated-diagnostics.com.
-
VMware Partner Exchange 2010 from where I sat
[Corporate Blogs] (EMC BlogRoll)Phew – after 5 days in Vegas, you get pretty cooked. At home now finally and love seeing my family. Well – a quick little summary of the week, and what we announced, showed, and discussed. Was a GREAT VMware Partner Exchange (PEX) – thank you VMware! So – what did we see and do? VMware’s continued growth is one of the largest drivers for partner growth. PEX attendence was up 77% over last year. VMware is super-focused on the partner community. This came through loud and c ...
Phew – after 5 days in Vegas, you get pretty cooked. At home now finally and love seeing my family. Well – a quick little summary of the week, and what we announced, showed, and discussed.
Was a GREAT VMware Partner Exchange (PEX) – thank you VMware!
So – what did we see and do?
- VMware’s continued growth is one of the largest drivers for partner growth. PEX attendence was up 77% over last year.
- VMware is super-focused on the partner community. This came through loud and clear in Carl Eschenbach’s keynote.
- We saw this too in the EMC bootcamp. We held a bootcamp on the Monday of the event for the EMC partners present. There were almost 200 people there all day long. Thank you EMC partners! Would love your feedback on the event.
- The VCE roundtable was packed – with 202 partners. Got great questions, and great feedback. Cisco’s posting the video shortly.
- The VCE reception was also fantastic, thanks everyone for attending.
- My team: there were a bunch of new members of the vSpecialist squad – was great to hang out with you!
- The EMC booth: was cool that we pulled together the Vblock 0 prototype to show at the show, and got positive feedback on the demos. Good team on the schwag too – those little wind-up flashlights will help deal when the inevitable apocalypse arrives :-) Oh, my kids will like them too :-)
- TAP: Got solid feedback on the vSphere roadmap and Storage Ecosystem roadmap sessions from some of the team newbies (of course, the vets all were in the loop on this already)
- VCDX defense for some of the folks on the team. Scott, good luck, I’m confident in you!
- Steve Herrod did some big unveiling. Project “Redwood” (end user self-service portal targeted for Private and Public cloud uses) was public outed for the first time, as well as the next version of VMware View (loads of stuff in here, more to come soon). He also talked about the scaling and feature goals of the next generation of the vSphere generation. Even with all the legalese and caveats at the front of the session, it’s very exciting stuff. My team and I are lucky to have insider front-row seats to everything that’s been going on – so none of this was a surprise to us, but it’s great that it’s getting out there.
- Fun stuff:
- Parties: the tailgate party Cisco and EMC sponsored, and was loads of fun! Great exciting Superbowl game. I have to say that personally, I think The Who can still rock.
- Playing craps with friends – I hit 4 straight yos when it counted, and the money was raining in. Also on the last night with some new friends where I was sucking it, but man, they were rolling like nobody’s businesss….
- The big party was fantastic – at the House of Blues, with a great 80’s band – the English Beat
- The VMware EBC on wheels – just awesome…. Pictures on that one below.
- The VMware/EMC/NetApp alliance team dinner :-) Pictures and the story on that one below!
In the EMC Bootcamp we spent the day on helping partners get more out of their business, and making sure they had all the latest tools we provide to help them help their customers. I did the keynote which was “VCE: What’s going on behind the scenes and how can you make the most of it”. This frankly discussed where we are (including gaps) since the VCE launch, as well as providing a technical preview into the integrated 3-company roadmap for the next year.
I also showed the new Celerra NFS datastore VMware capabilities, previewed the next version of the EMC storage Viewer, and the next version of Unified Infrastructure Manager (UIM v2). I also gave some big hints on the stuff that’s coming at EMC World – each of which will be huge….
If you’re interested in more detail, including demonstrations and screenshots of these new and “arriving so shortly it’s essentially now” functions that we talked about, read on….
The Celerra compression and deduplication engine has been updated and now handles the VMware use case.
Dedupe and compression are both variants of data-reduction technologies. I don't want to be TOO much of a nerd here, but the point is valid.
I want to explain this quickly, and I think the commonality and difference between compress/deduplication needs to be stated. Why? Now the full on “who’s production data reduction/dedupe/compression” fight is inevitably going to start up. Until now, except customers putting NFS datastores on Datadomain (not generally a good idea as that’s not what DataDomain is designed for – it is designed to be a killer dedupe backup target), NetApp seemed to be the only vendor in the market with a production capacity efficiency benefit on top of thin provisioning for production datastores.
Data reduction techniques have varying effectiveness/cost (and here cost means "CPU cycles, processing time, performance impact", ergo not $$ but "engineering costs") depending on the dataset. A trivial example:
- filesystem containing ten files. Four Files are EXACTLY the same.
- filesystem containing ten files. Files are similar, but not the same.
- file-level dedupe is extremely efficient in the first example (low impact, high capacity efficiency gain).
- compression is moderately efficient in the second example (low impact, moderate capacity efficiency gain)
- block-level dedupe is more capacity efficient in the second example (generally higher “cost”, high capacity efficiency gain)
Celerra F-RDEv2 (the nerdy engineering name - "File Redundant Data Elimination") is accurately characterized as dedupe and compression. It finds and deduplicates files at the file object level (which is the most efficient, and largest immediate savings for general purpose NAS), and compresses within files. F-RDEv1 skipped files >200MB. This meant that the original release (now out for about 1 year) had little efficiency effect on the “VMware NFS datastore use case” where the bulk of the capacity is in large multi-GB VMDK files.
Of course, the broad use of our NAS devices tends to be dominated by basic unstructured NAS, and we’ve been delivering massive efficiency gains there for our customers for a year now, and that use case was our original design focus.
The march of storage efficiency technologies continues for EMC as it does in the industry as a whole…. F-RDEv2 (which is GA, and has been and will continue to be free) now has no file-size restriction. This means that on top of Thin Provisioning, it provides about an additional 40-50% capacity savings gain when applied directly to the VMware on NFS use case. Testing has shown that it has no material effect on write performance (even helping in corner cases) and about a 10% impact on read performance. Dont’ think of what it’s doing as a “zip”, it’s leveraging core Recoverpoint technology that’s used for real-time compress/decompress.
On a side note, one thing that has been fascinating to watch inside the company has been the acceleration of innovation and integration across the parts of the company over the last 2 years. They’ve moved to this approach called “consumer/provider” where the roadmap has various teams providing deliverables for some, and consuming those of others. This is most visible within the recently renamed Unified Storage Division (that has CLARiiON, Celerra, Centera, Recoverpoint). For example, the iSCSI stack from CLARiiON and Celerra have merged. The block virtualization layer (performs critical elements for things like Thin Provisioning and other cool things to come) in CLARiiON and Celerra are actually the same codebase (CBFS) now. Another example is Avamar and Recoverpoint IP being embedded in the NAS code. Anyone who knows engineering of this scale knows that it takes about 2 years minimally for changes to show up. Trust me, there are massive cool payoffs in store here based on the work of the last two years. Oh – EMC World is so close :-)
Another nice thing about the engineering approach used by F-RDEv2 is that it has no impact reduction on filesystem size, is unaffected by any other Celerra feature (snapshots etc), certain other things that are nice, like being able to target datastore or VM-level objects (and many other things)
When will you be able to get this? Early March, and it is FREE!
This continued march of storage efficiency (in both capacity, power and flexibility dimensions) will not stop…..
Virtual-Machine Level array-based snapshots and clones and dramatically simpler provisioning
The same release that expanded the application of redundant data elimination to customers using Celerras also added the ability to snapshot and clone individual files within a filesystem (in fact also clone a file ACROSS filesystems). While inevitably there’s areas where each vendor does something before the other, this is one where NetApp got the ball rolling with their Rapid Clone Utility (RCU) and ONTAP 7.2.3. and later. While I don’t claim to be an expert on NetApp, these seem to be very analagous (as always there’s certain things EMC does they don’t and vice versa) customers interested should compare both and evaluate for themselves.
“There’s an App for that” in VMware-land = “there is also a vCenter plugin for that”
So, as the array got this “VM level” operation we extended the vSphere client to make using it simple and easy. It also makes provisioning NFS datastores a lot easier (automatically configuring all the Celerra and ESX host properties), scaling easier (does it across the entire vSphere cluster in a single operation), and also makes expanding datastores a snap.
This also adds has the compression/dedupe control directly in vCenter – as well as the ability to quicky and easily see the capacity savings.
BTW I don’t want to over-sell this function. I personally think that in the client virtualization (View) use case, more customer pain is more about client image management and composition, not storage capacity. The problem of client image management and composition can actually be solved much better at the vSphere layer. People who see View 4 (and partners at PEX got a preview of the next rev which takes it even further).
Don’t get me wrong, being more efficient is good, but our guidance for customers will not to be to use this new function to eliminate the stuff that View Composer/Thinapp can be used for. Also, in the View use cases, the most important thing you can do to reduce storage requirements is drive down the per-guest IO workload (follow VMware’s VDI best practices!)
The VM hardware-accelerated snap/clone is more useful in many general VMware virtual machine use cases. BTW - this same idea (hardware-accelerated ESX VM-level snapshot/clone) is coming to block datastores in the vStorage APIs for Array integration, which EMC will be completely supporting on our current-generation array block targets.
When will you be able to get this? Early March, and it is FREE!
This continued march will not stop….
Here’s a demo of these new EMC Celerra NFS functions… Those of you wondering if you can try it with the Celerra VSA, let me do one more update before you spend the time to make it work (it works with the CMR-11 Celerra VSA that is posted, but is simpler and easier with the one targeted for March).
You can download the high-resolution WMV version here, and the MOV version here.
Previewed the next version of the EMC Storage Viewer vCenter plugin
Customer feedback on the EMC Storage Viewer vCenter plugin has been very positive, and we’re investing even more resources into it now.
The feedback that we got was that:
- Make configuring Solution Enabler (a pre-requisite) easier. Note that Solutions Enabler is now available as a Virtual Appliance (like an ever expanding set of EMC products) here.
- Add performance data (coming in this next release)
- make features/functions more consistent across both NAS and Block use cases (coming in this next release)
- Give the VMware administrator provisioning control, but only if their “portion of the storage array can be carved out”. Since most arrays (outside the storage used in a Vblock) are used for multiple uses at the same time, there’s reasonable concern about one action for a given use could unpredictably impact other uses. The way we are implementing this, the storage team can assign virtual storage pools to the VMware team, who can then provision themselves directly within the vSphere client. A screenshot is below.
Previewed the next version of the EMC Ionix Unified Infrastructure Manager (UIM) v2
BTW – My comments here (on vBlocks/portal in the “cloud compute use case) and here (on multitenancy in cloud compute cases) now make more sense that the idea of Redwood is out there. So let me explain it a bit further.
Basically Redwood takes the vSphere/vCenter layer which provides big aggregated pools of CPU/Memory/Network/Storage and hides all complexity and puts a multitenant front-end that enables simple end user self-provisioning.
But, the pools themselves are assumed static. In the example Steve used during the demo, the end user provisions a VM and starts using it. The VMware administrator sees that the datastore that Redwood selected was getting full, and then uses storage vMotion to non-disruptively move the virtual disks to another less utilized datastore.
What we’ve been working on with Vblock and UIM is to extend the idea of automated infrastructure right down to the infrastructure stack itself that supports vSphere. This would mean that the datastore could automatically expand in the example use case. Or if more or less compute was needed OVERALL within the vSphere cluster, that could be added/removed as needed.
Put another way:
Our design goal is to make it so that if vCenter and Redwood need the vSphere cluster to get 4 new hosts, they appear. Need more datastores? No problem, added – automatically. Using less and want to put the hardware back into a pool for other clusters? No problem, vacate, maintenance mode, remove cluster node, and release unused storage and networking elements – all automatically.
In discussions with service providers and enterprise customers trying to stand up these cloud compute services (IaaS for public or internal use cases), the biggest challenge has been the construction and maintenance of the “end user portal”. Redwood simplifies and automates the layer between the end-user and the vSphere layer - Redwood is focused on making that easier for the enterprise/service provider.
The second most difficult technology challenge at the infrastructure layer was trying to link into a bunch of disparate APIs for what ever combination of gear they pick to simplify and automate infrastructure provisioning at scale. UIM and VBlock makes the actual infrastructure itself elastic and elastic in an automated way - UIM is focused on making that easier for the enterprise/service provider.
This highlights the point we’re trying to make with Vblock. It represents a single “product”, not a combination of products (the ideas of the VCE SST and the VCE support model flow from that – a “product” is not sold and supported by 3 companies, but rather by ONE company – these work efforts are hard, much harder than creating the reference architecture). So what is the product? It is a VM-housing black box. We can then optimize to a MUCH higher degree. And while there are others who try to be a “manager of mangers” – trying to cover the infinite set of permutations of servers/network/storage/OS is nearly impossible to nail. UIM is not analagous to BMC Bladelogic. It’s analagous to a “Element Manager”, where the element is a Vblock – who’s constituent elements are known and very defined.
It’s then possible to create a product that not only can manage multiple Vblocks from a single console, apply multitenancy models to the infrastructure if needed (though now that everyone has seen Redwood, it’s clear about my earlier point, and also what Chuck has been saying here - customer/customer multitenancy comes from the end-user portal. Enforcement of multitenancy through the rest of the stack doesn’t solve the “who watches the watcher” problem (that’s the translation of the latin quote Jonathan from NetApp uses in his comment in the thread). This question (“what stops the service provider from being able to get at my “stuff”)
Actually, if you go back and watch the tapes, Steve Herrod actually mentioned something coming to solve that too – he mentioned “watch closely to see things new about securing the cloud and demonstrating that capability and compliance posture to the end-customer”. It won’t be long… wait for it… wait for it…
Pictures of some of the fun..
Ben Matheson (VMware), Ed Bugnion (Cisco) and I opening up the tailgate party on Sunday…
So – what’s the story on the VMware/EMC/NetApp alliance dinner?
It’s no secret that I have a lot of respect for NetApp, both as a company and their technology. I also like having a competitor that is investing in VMware’s success in the market, and alongside EMC’s storage efforts (putting aside the fact that we also invest down the RSA and Ionix tracks). Good competition pushes us both continuously.
When it comes to VMware-focus and integration, in my eyes (which may be off-base), it’s basically EMC and NetApp and then the others are so far back that it doesn’t matter. That doesn’t mean they don’t have VMware integration, and that they aren’t fine, fine products, but they just don’t invest as much in this specific space.
Look, it’s not just engineering/product, but look at PEX for partner coverage. Ditto at VMworlds past and future. Whether it’s sessions, PR, marketing, customer stories. Heck even the pure passion that comes out in some of the discussions. Some times it drives me crazy when they do something that to me seems “over the line”, but hey I’m sure they think the same about us – and at least it’s never boring :-)
So, we’re having the EMC/VMware alliance teams dinner and I found out that the NetApp/VMware alliance teams was in the same restaurant. I wanted to propose we all just park our badges at the door and have dinner together, but wiser minds on my team suggested that it would be rude to cut into their dinner. But, I wanted to do something, so I sent over a couple bottles of champagne with a note. I was worried it might come across “dick-ish”, but apparently not (which is good – was not my intent) Then, Jim Sangster from NetApp kindly came over to the EMC table, then I came over to the NetApp table, and we shared a toast.
Was good to meet Jim Sangster (Sr. Diretor of the VMware Alliance) and Mitchell Ratner (VMware Global Alliance Manager) in the photo at left and share a toast in the photo on the right – and a great week overall.
To all the VMware and EMC partners at the event – thank you, and it was great to hang out, learn and have fun!
-
Web bioinformatics developer (South Lake Union, Seattle)
[Jobs] (craigslist | all jobs in seattle-tacoma)About LabKey Software LabKey Software works in partnership with leading scientists to create powerful applications for managing biomedical research data. We build solutions tailored to our clients needs by leveraging the LabKey Server open source platform, a secure, scalable, web-based foundation for collaborative research. Our client projects both benefit from and contribute to the open source community, building an ever-more capable platform available to everyone. We provide competitive pa ...
About LabKey Software
LabKey Software works in partnership with leading scientists to create powerful applications for managing biomedical research data. We build solutions tailored to our clients needs by leveraging the LabKey Server open source platform, a secure, scalable, web-based foundation for collaborative research. Our client projects both benefit from and contribute to the open source community, building an ever-more capable platform available to everyone.We provide competitive pay and benefits and flexible work schedules. The environment at LabKey is based on open communication and solving customer needs through a combination of individual ownership and group refinement of ideas. Our offices are located on the main campus of the Fred Hutchinson Cancer Research Center in South Lake Union, Seattle. LabKey is an equal opportunity employer.
For more information about the company, please visit www.labkey.com. For more information about our software, visit www.labkey.org.
Summary
This is a full-time, entry-level opportunity for a hard working web developer interested in tailoring installations of the LabKey Server open-source platform to help individual scientists integrate, analyze, and share large, complex datasets. You will build customized solutions that satisfy the specific needs of customers, accelerating their work in fields like cancer and HIV research.You will work with internal and external customers to gather requirements and implement them using client and server-side scripting languages. You will create web-based user interfaces, reports, analysis tools, quality control verification scripts, and more. You will also develop prototypes for sales situations and help respond to support questions from end users and external developers. You will participate in all parts of the product development cycle.
This position has strong career growth possibilities.
Qualifications
- Track record of solving bioinformatics solutions
- Understanding of general software development principles
- Experience with web page development (static and AJAX) and other scripting languages
- Working knowledge of relational databases and SQL
- Comfortable working directly with customers to understand use cases, record requirements, and provide technical support
- Excellent communication and writing skills
- Authorized to work in the US - we are unable to sponsor work visas
Experience and Education
Bachelors degree in bioinformatics, life sciences, or computer science with at least 2 years software development experienceRelevant Tools and Technologies
- JavaScript, including AJAX and ExtJS
- HTML and CSS
- R
- SQL
- Perl
- Java and JSP
- Subversion
A track record of working hard, taking ownership, and learning new technologies quickly is more important than any specific skill.
Benefits 401(k) matching contribution, bonus plan, medical/ vision, life, AD&D;, FSA, flexible work schedule.
Contact
Please email your resume to jobs@labkey.com. -
Web bioinformatics developer (South Lake Union, Seattle)
[Jobs] (craigslist | all jobs in seattle-tacoma)About LabKey Software LabKey Software works in partnership with leading scientists to create powerful applications for managing biomedical research data. We build solutions tailored to our clients needs by leveraging the LabKey Server open source platform, a secure, scalable, web-based foundation for collaborative research. Our client projects both benefit from and contribute to the open source community, building an ever-more capable platform available to everyone. We provide competitive pa ...
About LabKey Software
LabKey Software works in partnership with leading scientists to create powerful applications for managing biomedical research data. We build solutions tailored to our clients needs by leveraging the LabKey Server open source platform, a secure, scalable, web-based foundation for collaborative research. Our client projects both benefit from and contribute to the open source community, building an ever-more capable platform available to everyone.We provide competitive pay and benefits and flexible work schedules. The environment at LabKey is based on open communication and solving customer needs through a combination of individual ownership and group refinement of ideas. Our offices are located on the main campus of the Fred Hutchinson Cancer Research Center in South Lake Union, Seattle. LabKey is an equal opportunity employer.
For more information about the company, please visit www.labkey.com. For more information about our software, visit www.labkey.org.
Summary
This is a full-time, entry-level opportunity for a hard working web developer interested in tailoring installations of the LabKey Server open-source platform to help individual scientists integrate, analyze, and share large, complex datasets. You will build customized solutions that satisfy the specific needs of customers, accelerating their work in fields like cancer and HIV research.You will work with internal and external customers to gather requirements and implement them using client and server-side scripting languages. You will create web-based user interfaces, reports, analysis tools, quality control verification scripts, and more. You will also develop prototypes for sales situations and help respond to support questions from end users and external developers. You will participate in all parts of the product development cycle.
This position has strong career growth possibilities.
Qualifications
- Track record of solving bioinformatics solutions
- Understanding of general software development principles
- Experience with web page development (static and AJAX) and other scripting languages
- Working knowledge of relational databases and SQL
- Comfortable working directly with customers to understand use cases, record requirements, and provide technical support
- Excellent communication and writing skills
- Authorized to work in the US - we are unable to sponsor work visas
Experience and Education
Bachelors degree in bioinformatics, life sciences, or computer science with at least 2 years software development experienceRelevant Tools and Technologies
- JavaScript, including AJAX and ExtJS
- HTML and CSS
- R
- SQL
- Perl
- Java and JSP
- Subversion
A track record of working hard, taking ownership, and learning new technologies quickly is more important than any specific skill.
Benefits 401(k) contribution, bonus plan, medical/ vision, life, AD&D;, FSA, flexible work schedule.
Contact
Please email your resume to jobs@labkey.com. -
Rich Internet Application Screen Design
[User Interface] (UX Magazine)A comprehensive guide to RIA screen design patterns. Designing a rich Internet application (RIA) can test even an experienced design team. The hardest challenge is to blend Web and desktop paradigms to create a responsive and intuitive experience. Some paradigms that exist in the desktop environment are ill-suited for the Web, while many of the Web paradigms people are familiar with (paging, explicit refresh) are no longer necessary with RIA technologies like Flex and Ajax. As ...
Designing a rich Internet application (RIA) can test even an experienced design team. The hardest challenge is to blend Web and desktop paradigms to create a responsive and intuitive experience. Some paradigms that exist in the desktop environment are ill-suited for the Web, while many of the Web paradigms people are familiar with (paging, explicit refresh) are no longer necessary with RIA technologies like Flex and Ajax. As this space matures, we are learning more and more about which boundaries can be pushed, and which patterns transcend time and technology. While working on the book Designing Web Interfaces, Bill Scott and I explored hundreds of Web applications searching for these patterns. Armed with a crazy amount of examples, we distilled the patterns into six principles:
- Make It Direct
- Keep It Lightweight
- Stay in the Page
- Provide Invitations
- Use Transitions
- React Immediately
But we didn't tackle the larger topic of how to create a rich application. What is the process? How did products like Mint, Balsamiq, and Wufoo get so good?
This article will outline the process we use to create rich applications, focusing primarily on screen design. All of the content is geared specifically toward productivity applications like Software as a Service (SaaS) products and Rich Enterprise Applications (REAs).
Designing for Richness
Adding an accordion or coverflow to an application doesn't make it rich. It could even degrade users' experiences. Starting a design/redesign at the control or screen level can be an expensive and frustrating exercise. Really good RIAs embody richness on all four of these levels:
- Application Structure
- Screen Design
- UI Controls
- Interaction Design
Application Structure
Structuring the application properly will ensure a solid base for the rest of the design process. There are three types of application structure:
- Information: The right structure to use when people need to browse, compare, comprehend information. For example, maps, news readers, dashboards, media players, online stores, etc.
- Process: The right structure to use when people need to provide information in a structured manner. For example, product configuration, setup, or installation; registration forms; tax preparation; checkout; booking travel.
- Creation: The right structure to use when people need to create new content or modify existing content. For example, blogging, illustrating, coding, photo editing, diagramming.
To pick the right one, we need to know the primary user's goal.
Application structure is based on the user's goal.
Unless we're designing a product the likes of which the market has never imagined, identifying the users goal can be a quick and painless process. We typically work closely with the business/product owner to write a short story for each of the primary personas. At this stage it is essential to focus on the goal, not the tasks. We sketch up a storyboard and validate it with an informal user group.
Early storyboard to validate user goals.
Once we understand the user goals, we can pick the right application structure. Some applications will have multiple structures; maybe the user needs to configure a product initially (process structure), but afterwards the goal is to analyze information (information structure). This is completely normal and can be designed for.
Next we build on the personas and storyboards to diagram the screen map and process flow diagrams.
Screen map using the information application structure and hub-and-spoke principle.
Once these are established, we can start designing the screens.
Screen Design
We approach screen design with the "one screen per goal" philosophy. For instance, if the user's goal is to find a couple of houses and contact a realtor to set up showings, we design one screen to support this instead of creating one screen for every task in the flow. This takes some finesse and discipline, but it will eliminate unnecessary navigation from the most common workflows.
These 15 screen layouts illustrate the current best practices in RIA design. Be sure to check out the accompanying presentation on Slideshare (also embedded at the end of this article) with examples from 80 recent RIAs.
Master/Detail
The Master/Detail screen layout can be vertical, horizontal, or even nested. It is ideal for creating an efficient user experience by allowing the user to stay in the same screen while navigating between items. A horizontal layout is a good choice when the user needs to see more information in the master list than just a few identifiers, or when the master view is comprised of a set of items that each have additional details.
Column Browse
The Column Browse screen layout can be vertical or horizontal and a number of levels deep. Ideal for creating an custom user experience by allowing the user to start from various entry points for navigating hierarchal or related data.
Palette/Canvas
Palette/Canvas is the perfect layout for the documentation or creation of linear or non-liner processes, flow diagrams, screen layouts, and designs/diagrams with physical size or layout constraints. The palette can be floating, dockable, or permanently situated. Consider offering a "fullscreen" toggle to give users maximum real estate for creating content.
Dashboard
A Dashboard layout will provide key information at a glance, real time data, easy to read graphics, and clear entry points for exploration. Stephen Few's book, Information Dashboard Design: The Effective Visual Communication of Data can be used for reference in designing and testing Dashboard designs.
Spreadsheet
The Spreadsheet layout can offer easy edits, additions, previews and totaling. This type of screen should provide the following functionality: standard table features like sort, hide/show columns, rearrange columns, group by (if applicable), global level undo/redo, add/insert/delete row, keyboard navigation, import and export, and possibly preview and summary functionality.
Interactive Model
The Interactive Model layout is characterized by many interactive elements associated with a core object (e.g., a graph, calendar, map, sheet music, or text). It closely aligns with the user's mental model and offers direct manipulation.
Search/Results
The Search screen pattern can range from very simple to quite advanced. This pattern is ideal for creating an efficient user experience by allowing the user to navigate directly to an item or set of items meeting specific criteria.
Refine Dataset
The Refine Dataset layout can be vertical or horizontal, and is ideal for creating an efficient user experience by allowing the user to refine a set of known data, or further refine search results.
Parallel Panels
Parallel Panels can be stacked (showing one at a time) or unstacked (showing all at once). This pattern is ideal for organizing chunks of information that are similar or have interdependent tendencies. Efficiency is gained by keeping the user in one screen. Ideal candidates for the stacked variation of this pattern are simple workflows with a visible goal that is fed by multiple inputs or multiple non-sequential steps.
Wizard
The Wizard layout is ideal for guiding a user through a complex or infrequent workflow. It can be vertical or horizontal depending on the nature of the data.
Question/Answer
The Q/A screen layout is ideal for helping a user quickly find a solution. Q/A differs from Search/Results in that it can assist users in identifying possible options or a single recommendation in an arena they are lacking expert knowledge (e.g., health insurance, mortgages, or budgeting).
Forms
Any Form layouts should be approached with a solid understanding of usability and design best practices. Web Form Design: Filling in the Blanks by Luke Wroblewski is a terrific resource for designing forms.
Portal
Portal layouts can provide a high degree of customization for users. They're ideal for news sites, but not a replacement for a well-designed Dashboard in a business application.
Tabbed
Tabs can be vertical or horizontal. The Tabbed layout should be explored after all other layouts have been considered. Before choosing the Tabbed layout, double check that this approach won't make users tab between sections to complete a single workflow. Remember to apply the "one screen per goal" philosophy. A Tabbed layout can work well when there is workflow requiring data to be analyzed from multiple perspectives, as with a list, chart, and heat map.
Browse
Browse can provide the best layout for users who goal is to quickly scan and navigate information. It can be two or three columns and typically the primary content is in the left most column, with additional related options served up in the right column(s).
Examples
Check out these designs in action; this presentation includes screenshots from 80+ current RIAs organized by screen layout.
View more presentations from Theresa Neil.
UI Controls
Equally important as the layout is the selection of the right UI controls for the screen. Although we try to avoid designing for the chosen framework, we do invest a good chunk of time learning about it so we know what is technically feasible.
While some frameworks have a great set of controls out of the box (JQuery, Flex, ExtJs, Telerik Rad Controls), others offer only a subset of common controls and/or are designed with no regard for usability. Talk with the development team about what is feasible with the chosen framework, timeline, and budget. We use these 30 Essential Controls as our starting point for these discussions.
Once we get the screens 60-70% complete, we transition to prototyping, and those early discussions with the development team prove invaluable. 70% is the amount of design Todd Warfel recommends having design be 70% complete before beginning prototyping in his book, Prototyping: A Practitioner's Guide. He suggests getting just enough designed to get buy-in from the user group and the stakeholders. Here are the types of questions we get when we pitch these designs:
- What happens when I click here?
- Will this move when I...?
- Oh, cool, can I...?
Instead of designing more wireframes to answer these questions, we build an interactive prototype. The best practices and examples in Designing Web Interfaces guide the interaction design down a proven path. And the resulting prototype can be played with, tested, refined, and then delivered as part of the development specs. This has proved to be quite a bit quicker, cheaper, and more effective for getting a good RIA developed than our old approach which resulted in a massive stack of wireframes and interaction specs.
Learn More
If you have found this article valuable, be sure to check out:
- Designing for Interesting Moments, a SlideShare presentation by Bill Scott.
- Designing Rich Applications, a SlideShareBy presentation by me, Theresa Neil.
- DesignGalleRIA, a design gallery and showcase of the best rich Internet applications.
- Web Form Design: Filling in the Blanks, by Luke Wroblewski. Rosenfeld Media, May 2008.
- Prototyping: A Practitioner's Guide, by Todd Warfel. Rosenfeld Media, May 2008.
- Designing for Flex,An Adobe Developer Connection article by Rob Adams.
- The Designer's Guide to Web Applications, Part 1: Structures and Flows , a User Interface Engineering paper by Hagan Rivers.
- Designing Web Interfaces: Principle and Patterns for Rich Interactions,by Bill Scott and Theresa Neil. O'Reilly Media, January 2009.
- About Face 3: The Essentials of Interaction Design, by Alan Cooper, Robert Reimann, and David Cronin. Wiley, May 2007.
- Designing Interfaces: Patterns for Effective Interaction Design, by Jenifer Tidwell. O'Reilly Media, November 2005.
-
Part nine: Climate scientists withheld Yamal data despite warnings from senior colleagues
[Guardian] (Science news, comment and analysis | guardian.co.uk)Ancient trees dragged from frozen Siberian bogs do not undermine climate science, despite what the sceptics say In a unique experiment, The Guardian has published online the full manuscript of its major investigation into the climate science emails stolen from the University of East Anglia, which revealed apparent attempts to cover up flawed data; moves to prevent access to climate data; and to keep research from climate sceptics out of the scientific literature. As well as including new informa ...
Ancient trees dragged from frozen Siberian bogs do not undermine climate science, despite what the sceptics say
In a unique experiment, The Guardian has published online the full manuscript of its major investigation into the climate science emails stolen from the University of East Anglia, which revealed apparent attempts to cover up flawed data; moves to prevent access to climate data; and to keep research from climate sceptics out of the scientific literature.
As well as including new information about the emails, we will allow web users to annotate the manuscript to help us in our aim of creating the definitive account of the controversy. This is an attempt at a collaborative route to getting at the truth.
We hope to approach that complete account by harnessing the expertise of people with a special knowledge of, or information about, the emails. We would like the protagonists on all sides of the debate to be involved, as well as people with expertise about the events and the science being described or more generally about the ethics of science. The only conditions are the comments abide by our community guidelines and add to the total knowledge or understanding of the events.
The annotations - and the real name of the commenter - will be added to the manuscript, initially in private. The most insightful comments will then be added to a public version of the manuscript. We hope the process will be a form of peer review. If you have a contribution to make, please email climate.emails@guardian.co.uk.
The anonymous commenting facility under each article will also be switched on so that anyone can contribute to the debate.
It is hard to believe that tree trunks dragged from frozen bogs in Siberia could undermine the argument about man-made climate change. But that is the claim that has been made by sceptics in recent months.
The claim is wide of the mark, but in the 1,073 emails stolen from the University of East Anglia last November the row over what the trees tell us about climate change is played out in detail. The scientists are shown clinging to their data to prevent it getting into the hands of sceptics even as at least one colleague advised openness to avoid the charge that "bogus science" was being hidden.
Measuring the width of annual growth rings in trees is a sensitive measure of temperatures. And the secrets of those Siberian trees, some of them thousands of years old, have assumed an important place in the reconstruction of past temperatures for the whole planet.
Steve McIntyre, a Canadian former minerals prospector and climate sceptic who has analysed the data, suggests that one tree, known as YAD06, could be "the most influential tree in the world".
In the hacked emails from the Climatic Research Unit at UEA, one word looms large: Yamal. The first and last emails and more than a hundred in between include it. When I phoned Prof Phil Jones, the director of CRU, on the day the emails were published online, he said: "It's about Yamal, I think."
On 6 March 1996, a Russian scientist, Stepan Shiyatov, contacted Dr Keith Briffa, CRU's top tree-ring researcher. Shiyatov wanted money to take a helicopter to measure tree rings in timber hauled from the permafrost of the Yamal peninsula on the Arctic ocean's shores.
Briffa was keen, and he published papers on what those tree rings showed. But by late last year, in the final emails, he is mired in allegations of fraud, and the Yamal data had become a virus infecting past climate reconstructions.
The Yamal data turned up in many studies of global temperature that were cited by the UN's top climate science body, the Intergovernmental Panel on Climate Change, in a report published in 2007, where the relevant section was authored by Briffa. It supported the conclusion that temperatures followed a "hockey stick" shape, with stable temperatures for a thousand years, then sharp 20th-century warming.
By then, McIntyre was on the trail. He claimed that Briffa had not used all the tree ring data available, only a subset. Briffa said there were technical reasons for that. But McIntyre complained Briffa hadn't spelled out those reasons clearly.
In 2008, when Briffa published some data after a long delay, McIntyre charged that Briffa's analysis of the most recent warming was based on just 12 trees: the "Yamal-12". McIntyre said this was too small a sample to draw any conclusions, and claimed if the analysis was redone with other tree ring data from the region, the hockey stick shape disappeared.
It looked like a stalemate. But last year the bloggers moved in. Ross Kaminsky, a columnist on American Spectator, claimed: "One implication, supported by Briffa's near decade-long refusal to share his data, is that he cherry-picked the dataset that supported the conclusion he wanted to find."
Worse was the charge that other scientists had used the suspect Yamal data in their reconstructions of past climate. Ross McKitrick, a climate sceptic and environmental economist at Canada's University of Guelph, wrote that they are "the key ingredient in most of the studies that have been invoked to support the hockey stick". The Daily Telegraph blogger James Delingpole went even further in an article headlined: "How the global warming industry is based on one MASSIVE lie."
Briffa denies any wrongdoing. He said "we would never select or manipulate data in order to arrive at some preconceived or unrepresentative result". And there is nothing in the emails or anywhere else to suggest that isn't true. In September last year Briffa put out a statement on the CRU website defending his research. "We do not select tree-core samples based on comparison with climate data. Chronologies are constructed independently and are subsequently compared with climate data to measure the association and quantify the reliability of using the tree-ring data as a proxy for temperature variations."
One British colleague of Briffa wrote to me last month: "Why should Briffa – one of the world leaders in this field – have to explain himself to people … who are in fact amateurs?"
But others believe Briffa has a duty to explain himself. In October last year, Briffa's old boss at CRU, Tom Wigley, said in an email to Briffa's current boss, Phil Jones: "Keith does seem to have got himself into a mess." Wigley felt Briffa had not answered McIntyre's charges fully. "How does Keith explain the McIntyre plot that compares Yamal-12 with Yamal-all? And how does he explain the apparent 'selection' of the less well-replicated chronology rather than the later (better replicated) chronology? …
"The trouble is that withholding data looks like hiding something, and hiding something means (in some eyes) that it is bogus science that is being hidden."
The Yamal data has become important for scientists trying to analyse past climates. But it is not true that the Yamal rings are omnipresent in climate reconstructions. They were not in the data that produced the "hockey stick" graphs. According to Jones, of the 12 reconstructions of temperatures over the past 1,000 years used in the last IPCC assessment, only three included Yamal data. Other reconstructions were based on retreating glaciers, or water temperatures in boreholes, or core sunk into ice sheets – but they too reproduce a hockey stick shape.
Even McIntyre denounces the more vocal sceptics with their conspiracy theories. In an apparent response to a challenge from the climate scientists' website RealClimate, he wrote to the American Spectator last October: "While there is much to criticise in the handling of this [Yamal] data, the results do not in any way show that AGW [anthropogenic global warming] is a 'fraud', nor that this particular study was a 'fraud'. There are many serious scientists who are honestly concerned about AGW and your commentary … is unfair to them." Sadly, when checked last week, there was no sign of this comment on the magazine website, though the magazine had found room for another feature on "The great hoax" of climate change.
guardian.co.uk © Guardian News & Media Limited 2010 | Use of this content is subject to our Terms & Conditions | More Feeds -
Sex Determining Region Y-Box 2 (SOX2) Is a Potential Cell-Lineage Gene Highly Expressed in the Pathogenesis of Squamous Cell Carcinomas of the Lung
[Science] (PLoS ONE Alerts: New Articles)Background Non-small cell lung cancer (NSCLC) represents the majority (85%) of lung cancers and is comprised mainly of adenocarcinomas and squamous cell carcinomas (SCCs). The sequential pathogenesis of lung adenocarcinomas and SCCs occurs through dissimilar phases as the former tumors typically arise in the lung periphery whereas the latter normally arise near the central airway. Methodology/Principal Findings We assessed the expression of SOX2, an embryonic stem cell transcriptional factor ...
BackgroundNon-small cell lung cancer (NSCLC) represents the majority (85%) of lung cancers and is comprised mainly of adenocarcinomas and squamous cell carcinomas (SCCs). The sequential pathogenesis of lung adenocarcinomas and SCCs occurs through dissimilar phases as the former tumors typically arise in the lung periphery whereas the latter normally arise near the central airway.
Methodology/Principal FindingsWe assessed the expression of SOX2, an embryonic stem cell transcriptional factor that also plays important roles in the proliferation of basal tracheal cells and whose expression is restricted to the main and central airways and bronchioles of the developing and adult mouse lung, in NSCLC by various methodologies. Here, we found that SOX2 mRNA levels, from various published datasets, were significantly elevated in lung SCCs compared to adenocarcinomas (all p<0.001). Moreover, a previously characterized OCT4/SOX2/NANOG signature effectively separated lung SCCs from adenocarcinomas in two independent publicly available datasets which correlated with increased SOX2 mRNA in SCCs. Immunohistochemical analysis of various histological lung tissue specimens demonstrated marked nuclear SOX2 protein expression in all normal bronchial epithelia, alveolar bronchiolization structures and premalignant lesions in SCC development (hyperplasia, dysplasia and carcinoma in situ) and absence of expression in all normal alveoli and atypical adenomatous hyperplasias. Moreover, SOX2 protein expression was greatly higher in lung SCCs compared to adenocarcinomas following analyses in two independent large TMA sets (TMA set I, n = 287; TMA set II, n = 511 both p<0.001). Furthermore, amplification of SOX2 DNA was detected in 20% of lung SCCs tested (n = 40) and in none of the adenocarcinomas (n = 17).
Conclusions/SignificanceOur findings highlight a cell-lineage gene expression pattern for the stem cell transcriptional factor SOX2 in the pathogenesis of lung SCCs and suggest a differential activation of stem cell-related pathways between squamous cell carcinomas and adenocarcinomas of the lung.
-
Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets
[Science] (PLoS Computational Biology: New Articles)Author Summary Many human diseases are related to each other through shared causes or even shared pathology. Knowledge of these relationships has long been exploited to treat similar diseases with the same therapies. However, most of the traditional approaches to discover these relationships have depended on subjective measures, such as similarity in symptoms, or incomplete knowledge, such as genes with mutations. Here we present the first approach integrating high-throughput datasets such as m ...
Author SummaryMany human diseases are related to each other through shared causes or even shared pathology. Knowledge of these relationships has long been exploited to treat similar diseases with the same therapies. However, most of the traditional approaches to discover these relationships have depended on subjective measures, such as similarity in symptoms, or incomplete knowledge, such as genes with mutations. Here we present the first approach integrating high-throughput datasets such as mRNA expression and large-scale protein-protein interaction networks to discover human disease relationships in a systematic and quantitative way. We discover 138 significant pathological similarities between 54 human diseases ranging from lung cancer, schizophrenia, and malaria. We also discovered a set of common pathways and processes within the cell that are dysregulated in at least half of the diseases. We infer that these processes correspond to a common response of the human system to a disease state. Interestingly, we find that many of the proteins in these pathways are already known to be targets of existing drugs. In fact, the drugs corresponding to these proteins are known to treat significantly more diseases than expected by chance highlighting the importance of these common molecular pathological pathways as prime therapeutic opportunities.
-
GPS Data Collection and GIS Mapping Specialist (Puyallup)
[Jobs] (craigslist | all jobs in seattle-tacoma)We are looking for a part-time Mapping Specialist that will work independently with minimal supervision to research, plan, collect, interpret and print scaled maps for a variety of environmental projects, such as Sustainable Neighborhoods, Wetland and Stream Boundaries, Habitat and Species studies, Natural Resources, Mitigation Plans, Environmental Site Assessments, Soil, Groundwater and Soil Vapor Investigations, and Remediation. This position will prepare, layer, and scale maps and figure ...
We are looking for a part-time Mapping Specialist that will work independently with minimal supervision to research, plan, collect, interpret and print scaled maps for a variety of environmental projects, such as Sustainable Neighborhoods, Wetland and Stream Boundaries, Habitat and Species studies, Natural Resources, Mitigation Plans, Environmental Site Assessments, Soil, Groundwater and Soil Vapor Investigations, and Remediation.
This position will prepare, layer, and scale maps and figures such as: Vicinity, Parcel, Plot, Aerial Photograph, Sample Location, Topographic, National Wetlands Inventory, Soil Survey, Flood Plain, Local Wetlands Inventory, Wetland and Stream Survey, Water Type, Priority Habitat and Species, and Lidar.
PROFESSIONAL REQUIREMENTS
Strong GIS skills with the ESRI suite of software (ArcInfo, ArcGIS 9.3, ArcView, ArcMap), AutoCAD, Google Earth, Trimble and Delorme devices
Proficient with Microsoft software (Word, Excel, Access, Power Point)
Use GPS to collect environmental data and navigate in the field
Good information technology (IT) technical skills
A basic understanding behind data management and relationship databases
Maintains and updates the existing GIS library and data sets
Strong understanding and import of raster and vector data collection and conversion
Experience in working with a variety of coordinate systems and performing transformations from variable datasets
Skilled in geo-referencing images into various GIS platforms
Advanced computer cartography and scale conversions
Exceptional technical editing and writing ability
Work and import shapefiles, professional land surveys and .pdf conversions
Proven ability to organize large amounts of data and present in easily understandable format
Strong mathematical, statistical and analytical proficiency
Ability to create, edit, display and analyze environmental data
Analyze data to determine validity, quality, and scientific significance, and to interpret correlations between human activities and environmental effects
Prepare charts or graphs from data samples and provide summary information on the environmental relevance of the data
SOCIAL AND OTHER REQUIREMENTS
Comfortable with a fast-paced office and rapidly changing environment
Self motivated and independent worker and able to finish a job with minimal oversight within given timeframe and budget
Positive and able to communicate and work effectively with staff and clients
Must be a people person and able to work as a team
Works with a can-do attitude
Ability to make informed decisions quickly and efficiently
Communicate scientific and technical information through oral briefings, written documents, workshops, conferences, and public hearings.
Able to walk, stand, bend and lift 30+ pounds, work in all weather conditions and difficult terrains
Capable of working both inside (at the office) and outside (on the job site)
Valid drivers license and dependable vehicle for travel to job sites
EXPERIENCE
One year work experience in a professional company or pubic agency preparing scaled GIS maps.
EDUCATION
Bachelors or higher level college degree in GIS, Geography, Computer Science, Environmental or Engineering discipline.
WORK SCHEDULE
Part-time, Monday through Friday.
RESUME SUBMITTAL INFORMATION
Closing date:
2.16.10
Mail resume to:
EnCo Environmental Corporation
P.O. Box 1212
Puyallup WA 98371
E-mail resume to:
gkemp@encoec.com
Fax resume to:
253.841.0264
-
Climate scientists withheld Yamal data despite warnings from senior colleagues | Fred Pearce
[Guardian] (Science news, comment and analysis | guardian.co.uk)Ancient trees dragged from frozen Siberian bogs do not undermine climate science, despite what the sceptics say Leaked climate change emails scientist 'hid' data flaws Climate change emails reveal flaws in peer review Controversy behind climate science's 'hockey stick' graphIt seems hard to believe that a handful of tree trunks dragged from frozen bogs in Siberia could undermine the argument about man-made climate change. But that is the claim that has been made by sceptics in recent months.Th ...
Ancient trees dragged from frozen Siberian bogs do not undermine climate science, despite what the sceptics say
Leaked climate change emails scientist 'hid' data flaws
Climate change emails reveal flaws in peer review
Controversy behind climate science's 'hockey stick' graphIt seems hard to believe that a handful of tree trunks dragged from frozen bogs in Siberia could undermine the argument about man-made climate change. But that is the claim that has been made by sceptics in recent months.
The claim is wide of the mark, but in the 1,073 emails stolen from the University of East Anglia last November the row over the trees and what they tell us about climate change is played out in detail. The scientists are shown clinging to their data to prevent it getting into the hands of sceptics even as at least one senior colleague advised greater openness to avoid the charge that "bogus science" was being "hidden".
Measuring the width of annual growth rings in trees is a sensitive measure of temperatures. And the secrets of those Siberian trees, some of them thousands of years old, have assumed an important place in the reconstruction of past temperatures for the whole planet. Steve McIntyre, a Canadian and former minerals prospector and climate sceptic who has analysed the data, suggests that one tree alone, known as YAD06, could be "the most influential tree in the world".
In the hacked emails from the Climatic Research Unit at the University of East Anglia, one word looms large: Yamal. The first and last emails and more than a hundred in between include it. When I phoned Prof Phil Jones, the director of CRU, on the day the emails were published online and asked him what he thought was behind it he said: "It's about Yamal, I think."
On 6 March 1996, a Russian tree ring researcher called Stepan Shiyatov contacted Dr Keith Briffa, CRU's top tree ring researcher. He was asking for money to take a helicopter to measure tree rings in timber hauled from the permafrost of the Yamal peninsula on the shores of the Arctic ocean.
Briffa was keen, and he published a series of papers on what those tree rings showed. But by late last year, in the final emails, he is mired in allegations of fraud, and the Yamal data had become a virus infecting reconstructions of past climate.
The Yamal data turned up in many studies of global temperature that were cited in the UN's top climate science body, the Intergovernmental Panel on Cliamte Change's report published in 2007, where the relevant section was authored by Briffa. It supported the conclusion that temperatures over the last thousand years followed a "hockey stick" shape, with stable temperatures over a thousand years followed by sharp 20th century warming.
By then, McIntyre was on the trail, however. He claimed that Briffa had not used all the tree rings data available, only a subset. Briffa said there were technical reasons for that. But McIntyre complained Briffa hadn't spelled out those reasons clearly.
And in 2008, when Briffa published some data after a long delay, McIntyre charged that Briffa's analysis of the most recent warming was based on just 12 trees: the "Yamal-12". McIntyre said this was far too small a sample to draw any conclusions, and he claimed that if the analysis were redone with other tree ring data from the region, the hockey stick shape disappeared.
It looked like a scientific stalemate. But last year political bloggers moved in. Ross Kaminsky, a columnist on the American Spectator magazine claimed: "One implication, supported by Briffa's near-decade long refusal to share his data, is that he cherry-picked the dataset that supported the conclusion he wanted to find."
Worse was the charge that other scientists had incorporated the suspect Yamal data into their reconstructions of past climate. Ross McKitrick, a climate sceptic and environmental economist at the University of Guelph wrote that they are "the key ingredient in most of the studies that have been invoked to support the hockey stick". Daily Telegraph blogger James Delingpole went even further in an article headlined: "How the global warming industry is based on one MASSIVE lie."
Briffa denies any wrongdoing. He said last autumn that "we would never select or manipulate data in order to arrive at some preconceived or regionally unrepresentative result". And there is nothing in the emails or anywhere else to suggest that isn't true. In September last year Briffa put out a statement on the CRU website defending his research. "We do not select tree-core samples based on comparison with climate data. Chronologies are constructed independently and are subsequently compared with climate data to measure the association and quantify the reliability of using the tree-ring data as a proxy for temperature variations. One British colleague of Briffa wrote to me last month: "Why should Briffa – one of the world leaders in this field – have to both explaining himself to people who are not even specialists in this area – who are in fact amateurs?"
But others believe Briffa does have a duty to explain himself. In October last year, Briffa's old boss at CRU, Tom Wigley, said in an email to Briffa's current boss Phil Jones: "Keith does seem to have got himself into a mess." Wigley felt Briffa had not answered McIntyre's charges fully. "How does Keith explain the McIntyre plot that compares Yamal-12 with Yamal-all? And how does he explain the apparent 'selection' of the less well-replicated chronology rather than the later (better replicated) chronology?... The trouble is that withholding data looks like hiding something, and hiding something means (in some eyes) that it is bogus science that is being hidden."
The Yamal data has become important for scientists trying to analyse past climates. But it is not true that the Yamal rings are an omnipresent virus in reconstructions of past temperature. They were not in the original data that produced the "hockey stick" graphs. According to Jones, of the 12 reconstructions of temperatures over the past 1,000 years used in the last IPCC assessment, only three included Yamal data. And other reconstructions of temperature based on retreating glaciers, or water temperatures in boreholes, or core sunk into ice sheets, self evidently do not contain Yamal tree rings. But they too reproduce a hockey stick shape.
Even McIntyre denounces the more vocal sceptics with their conspiracy theories. In an apparent response to a challenge from the climate scientists' website RealClimate, he wrote to the American Spectator last October, saying that: "While there is much to criticise in the handling of this [Yamal] data, the results do not in any way show that AGW [anthropogenic global warming] is a 'fraud', nor that this particular study was a 'fraud'. There are many serious scientists who are honestly concerned about AGW and your commentary... is unfair to them." Sadly, when checked last week, there was no sign of this comment on the magazine website, though the magazine had found room for another feature on "The great hoax" of climate change.
guardian.co.uk © Guardian News & Media Limited 2010 | Use of this content is subject to our Terms & Conditions | More Feeds -
Complete-Proteome Mapping of Human Influenza A Adaptive Mutations: Implications for Human Transmissibility of Zoonotic Strains
[Science] (PLoS ONE Alerts: New Articles)Background There is widespread concern that H5N1 avian influenza A viruses will emerge as a pandemic threat, if they become capable of human-to-human (H2H) transmission. Avian strains lack this capability, which suggests that it requires important adaptive mutations. We performed a large-scale comparative analysis of proteins from avian and human strains, to produce a ...
BackgroundThere is widespread concern that H5N1 avian influenza A viruses will emerge as a pandemic threat, if they become capable of human-to-human (H2H) transmission. Avian strains lack this capability, which suggests that it requires important adaptive mutations. We performed a large-scale comparative analysis of proteins from avian and human strains, to produce a catalogue of mutations associated with H2H transmissibility, and to detect their presence in avian isolates.
Methodology/Principal FindingsWe constructed a dataset of influenza A protein sequences from 92,343 public database records. Human and avian sequence subsets were compared, using a method based on mutual information, to identify characteristic sites where human isolates present conserved mutations. The resulting catalogue comprises 68 characteristic sites in eight internal proteins. Subtype variability prevented the identification of adaptive mutations in the hemagglutinin and neuraminidase proteins. The high number of sites in the ribonucleoprotein complex suggests interdependence between mutations in multiple proteins. Characteristic sites are often clustered within known functional regions, suggesting their functional roles in cellular processes. By isolating and concatenating characteristic site residues, we defined adaptation signatures, which summarize the adaptive potential of specific isolates. Most adaptive mutations emerged within three decades after the 1918 pandemic, and have remained remarkably stable thereafter. Two lineages with stable internal protein constellations have circulated among humans without reassorting. On the contrary, H5N1 avian and swine viruses reassort frequently, causing both gains and losses of adaptive mutations.
ConclusionsHuman host adaptation appears to be complex and systemic, involving nearly all influenza proteins. Adaptation signatures suggest that the ability of H5N1 strains to infect humans is related to the presence of an unusually high number of adaptive mutations. However, these mutations appear unstable, suggesting low pandemic potential of H5N1 in its current form. In addition, adaptation signatures indicate that pandemic H1N1/09 strain possesses multiple human-transmissibility mutations, though not an unusually high number with respect to swine strains that infected humans in the past. Adaptation signatures provide a novel tool for identifying zoonotic strains with the potential to infect humans.
-
There Are Two Types of Players...
[Baseball] (Baseball Analysts)In this article, I'll attempt to finish the title's sentence by doing a principle component analysis on player statistics. Going into this I had no idea what I would find or whether the principle component analysis would find anything interesting at all. For those unfamiliar with the type analysis, the point of it is to reduce a large number of potentially correlated variables down to a few key underlying factors that explain the variables. The researcher feeds the computer a bunch of recor ...
In this article, I'll attempt to finish the title's sentence by doing a principle component analysis on player statistics. Going into this I had no idea what I would find or whether the principle component analysis would find anything interesting at all.
For those unfamiliar with the type analysis, the point of it is to reduce a large number of potentially correlated variables down to a few key underlying factors that explain the variables. The researcher feeds the computer a bunch of records (in the this case, players) and several key variables (in this case, their statistics), The computer, blind to what those variables actually mean, spits out a set of underlying factors which explain the "true" underlying causes for the variables in question. It does this by maximizing the variability between the players. It's then up to the researcher to interpret what each factor represents. In this case, I'm looking for the one underlying factor that best describes a player.
In the baseball world, I wondered what one underlying factor best determined a player's statistics. Normally, this type of analysis would be done on many more variables, but I wanted to see what it would pick out from players' basic, non-team influenced statistics: 1B, 2B, 3B, HR, BB, K.
The principle component analysis spits out a bunch of factors, each with decreasing importance in determining a player's statistics. Only the first one really had much meaning to it, and with only six variables to analyze, this wasn't much of a surprise. The analysis attempts to differentiate players as much as possible, but the big question was how did it divide the players? It could have pitted good players vs. bad players, power hitters vs. contact hitters, patient players vs. free swingers, etc. But what happened?
In fact the factor loadings for the first principle component were as follows:
1B -.556
2B .132
3B -.259
HR .502
BB .382
SO .456As it turns out, the analysis shows that if you want to put the players into two distinct camps, one camp (whose overall scores will be positive) is made up guys who hit with power, walk a lot, and strikeout a lot, while another camp (whose scores will be negative) is made up of guys who hit a lot of singles and triples and make contact.
I actually think this makes a lot of sense in describing a player's hitting style in just one number. While of course there are plenty of metrics out there to determine a player's skill and value to a team, there isn't a single metric that describes a player's playing style on a sliding scale. A Batting Style score using these values as weights does just that.
On one end of the spectrum are contact hitters, small-ball, Mike Scioscia/Ozzie Guillen type players who make their living with singles, triples, and not striking out much. The other end are Earl Weaver/Billy Beane type players who hit homers and draw walks. Which type of player a man is best determines his statistics. It's Moneyball vs. small-ball. This one number represents the spectrum of playing styles.
To get a Batting Style score for each player, we can simply multiply their normalized statistics by the weights above. Doing so gives a normally distributed set of players with a range going from about -4 to 4. To make the results a little more intuitive, I converted this to a scale where the average was 100 with a standard deviation of 15. Players with high scores are "three true outcome" type players while those with low scores play with the opposite style.
How does the Batting Style number look according to 2009 data? The top ten most extreme players of each batting style are shown below:
Now, it's hard to imagine a two more different sets of players. Everything that the first group of players does well, the second group does poorly, and vice-versa. Both sets have some good players and some bad players, and whether a player is good or bad doesn't much affect his Style score. Adam Dunn and Jason Bay provided good hitting value to their clubs, as did Jacoby Ellsbury and Ichiro, they just did it in different ways. A stat like wOBA tells you the value of a particular player. For instance, in 2009 Russell Branyan had a wOBA of .368 and Ichiro had a wOBA of .369. So they seem like pretty much the same player, right? Of course not. Ichrio and Branyan have two completely opposite styles of play. Ichiro has speed, gets a ton of singles and rarely homers, walks, or strikes out. Meanwhile Branyan's entire value is based on the long ball and the base on balls. The Batting Style score shows the immense difference between the two players. Branyan has the fifth highest Batting Style score, while Ichiro has the second lowest score.
Of course, not every player falls into one of these two types. Players who have a "medium" style can have moderate scores on each metric. For example, Ronnie Belliard does everything about average, hence his Batting Style score is about average. It also includes unusual players who don't fall into the usual patterns. Aaron Hill doesn't walk much or strikeout much, but he hits homeruns. Hence, his overall style falls in the middle. Meanwhile Bobby Abreu walks a lot, but also gets a lot of singles. Hence, he doesn't fall into either extreme either. The Batting Style doesn't discriminate based on the skill of the player, although as you might expect, guys who have the power/walk Batting Style are as a whole slightly more valuable simply because guys who hit a lot of homeruns and take a lot of walks, are generally more valuable than singles hitters, though the difference is not major. Guys on the contact end of the spectrum have a wOBA of about 10 points lower than guys on the power end of the spectrum. You can check out the full list of player Batting Style scores here:
It's also interesting to look at this same list through history. Which players had the most extreme styles of during each decade? The list below (including all players with at least 1000 career PA's) shows the top three extreme players in each decade.
As you might expect, Babe Ruth is the original power/walk/strikeout player. As someone who revolutionized the game in that regard, it comes as no surprise. Harmon Killebrew, Mark McGwire, Dave Kingman, are others that famously fall into that same mold and are identified here. Meanwhile, Willie Wilson, Nellie Fox, and Matty Alou are on the other end of the spectrum - precisely the guys that you would expect. The analysis was run on the dataset as a whole (though to really be correct, it really should be run on each individual year). Over time, the styles have definitely shifted away from the contact approach and towards the power/walk style. Overall, there's not really a surprise in the bunch except for the fact that I've never heard of some of the older, more obscure players. Personally, I find both styles of player fun to watch as their extreme styles seem to make them more colorful, though I think that the power guys have historically caught more grief from fans and have been underrated up until the recent sabermetric revolution.
Whether a statistic like Batting Style has any real value to it or not, I think it's fun. Obviously, a line of six statistics isn't too hard to digest, but I like the idea of a single number describing a player's hitting style. In any case, it was interesting that the principle component analysis picked up on the two distinct styles and drew the scale the way it did. I think if you asked fans to name two completely opposite hitters, you would get a lot of Juan Pierre/Adam Dunn responses, which shows that the principle component analysis picked out an intuitive result.
-
Statistical Data Analyst, UCSF (San Francico)
[Jobs] (craigslist | all jobs in SF bay area)Statistical Data Analyst University of California, San Francisco (UCSF) The UCSF Center for Cerebrovascular Research has an excellent opportunity for a statistical data analyst to participate in various projects related to brain vascular malformations, which are an important cause of hemorrhagic stroke in young adults. We have a highly interactive and interdisciplinary group of investigators, including physician-scientists, genetic epidemiologists, statisticians, geneticists, and molecula ...
Statistical Data Analyst
University of California, San Francisco (UCSF)
The UCSF Center for Cerebrovascular Research has an excellent opportunity for a statistical data analyst to participate in various projects related to brain vascular malformations, which are an important cause of hemorrhagic stroke in young adults. We have a highly interactive and interdisciplinary group of investigators, including physician-scientists, genetic epidemiologists, statisticians, geneticists, and molecular biologists collaborating on clinical, translational and basic science projects. The research environment is enhanced by detailed phenotypic and genetic data from several large cohorts of healthy controls and different vascular malformation subtypes, including brain arteriovenous malformations and intracranial aneurysms. Our main laboratories and facilities are located at the San Francisco General Hospital campus, with patient recruitment and some research activities located at the Parnassus campus. Please visit http://avm.ucsf.edu/ for further information.
Job Responsibilities:
Assist in design of queries to extract medical and scientific data from several clinical databases, and manipulate large data files, such as those generated by genomewide array data
Ensure quality and integrity of data and produce reports
Create algorithms for data inspection and data cleaning
Develop analysis datasets and, under the direction of senior investigators, conduct complex statistical analysis, including analyses of large genome-wide SNP and expression data
Assist investigators in analyzing datasets for grant proposals or manuscripts, including power calculations and writing up results
Review programs and analyses conducted by others in the group for papers or grant proposals
Required Qualifications:
Masters level biostatistician or genetic epidemiologist with several years experience working in a medical/research environment
Proficiency in STATA programming or other commonly used statistical software packages (R, SAS, etc.), and willingness to learn STATA
Fluency in computer programming (Perl, C/C++, Java, etc.), and familiarity using different operating platforms, such as Linux
Experience with survival analysis techniques
Experience with sample size and power calculations
Excellent verbal and written interpersonal and communication skills
Attention to detail and quality of work
Ability to work well both independently and in a team
To apply, please send your resume/CV to Dr. William L. Young, MD, Director, Center for Cerebrovascular Research at ccr@sfgh.ucsf.edu
UCSF is an affirmative action/equal opportunity employer.
-
Improving DSEE7 Import Rate Through ZFS Caching
[Corporate Blogs, Enterprise, RIA (Rich Internet Apps)] (Sun Bloggers)As we all know, the process of importing data into the directory database is the first step in building a directory service. Importing is an equally important step in recovering from a directory disaster such as an inadvertent corruption of the database due to hardware failure or an application with a bug. In this scenario, a nightly database binary backup or an archived ldif could save the day for you. Furthermore, if your directory has a large number of entries (tens of millions) then the ...
As we all know, the process of importing data into the directory database is the first step in building a directory service. Importing is an equally important step in recovering from a directory disaster such as an inadvertent corruption of the database due to hardware failure or an application with a bug. In this scenario, a nightly database binary backup or an archived ldif could save the day for you. Furthermore, if your directory has a large number of entries (tens of millions) then the import process can be time consuming. Therefore, it is very important to fine tune the import process in order to reduce initialization and recovery time.
Most import tuning recommendations have focused on the write capabilities of the disk subsystem. Undeniably, it is the most important ingredient of the import process. However as we all know, the input to the import process is a ldif file which is used to initialize and (re)build the directory database. As demonstrated by our recent performance testing effort, the location of the ldif file is also very important. I'll mainly concentrate on ZFS in this post as time and again it has proven to be the ideal filesystem for the Directory. Note in some cases, you can save hour's of time by even the smallest gain in the import rate. Especially if your ldif file has tens of millions of entries.
Generally speaking there are few gotchas that need to be kept in mind for the import process. First thing is to ensure that you have a separate partition for your database, logs and transaction logs (this is actually true for any filesystem). For ZFS this translates into separate Pools. Similarly it is recommended to place the ldif file on a pool that is not being used for any other purpose during importing. This maximizes the read I/O for that pool without having to share it with any other process. In ZFS, the Adaptive Replacement Cache (ARC) cache plays an important role in the import process as seen in the table below. ZFS caches can be controlled via the primarycache and secondarycache properties that can be set via the zfs set command. This excellent blog explains these caches in detail. To understand and prove the effectiveness of these caches we ran few tests of imports on a SunFire X4150 system with ldif files of 3 million and 10 million entries each. The ldif file was generated using the telco.template via make-ldif. Details about the Hardware, OS and ZFS configuration and other useful commands are listed in the Appendix.
Dataset primarycache (6GB) secondarycache Time taken (sec) Import Rate (entries/sec) 3 Million
all all 887 3382.19 metadata metadata 1144 2622.38 metadata none 1140 2631.58 none none 1877 1598.3 all none 909 3300.33
10 Million
all all 3026 3304.69 metadata metadata 3724 2685.29 metadata none 3710 2695.42 none none 7945 1258.65 all none 3016 3315.65 The table shows the results of various combinations of primarycache and secondarycache on the ldifpool only. The db pool where the directory database is created always had primarycache and secondarycache set to all. The astute reader will notice from the Appendix that the ZFS Intent Log (ZIL) is actually configured on a flash memory. This did not have however skew our results as we are concerned with the ldifpool where the ldif file resides.
So going back to the table, as expected the primarycache (ARC in DRAM) is obviously the key catalyst in the read performance. Disabling it causes a catastrophic drop in the import rate primarily because prefetching also gets disabled and a lot more reads have to go to the disk directly. The charts below (data obtained via iostat -xc) depicts this very clearly as the disk are lot busier in reading when the primarycache is set to none for the 3 Million ldif file import.
So far, I have concentrated on discussing the primarycache (ARC). What about the secondarycache (L2ARC)? Typically the secondarycache is utilized optimally when used with a flash memory device. We did have flash memory device (Sun Flash F20) added to the ldifpool, however our reads were sequential and by design the L2ARC does not cache sequential data. So for this particular use case the secondarycache did not come into play as evident by the results in the table. Maybe if we limited the size of the ARC to just 1GB or less, the pre-fetches would have "spilled" over to the L2ARC and hence the L2ARC would have contributed more.
Finally a disclaimer, since the intent of this exercise is to show the effect of ZFS caches, the import rate results in the table are for comparison and not a benchmark. And i would also like to thank my colleagues who help me with this blog. These specialists are Brad Diggs, Pedro Vazquez, Ludovic Poitou, Arnaud Lacour, Mark Craig, Fabio Pistolesi and Nick Wooler.
Appendixzm1 # uname -a SunOS zm1 5.10 Generic_141445-09 i86pc i386 i86pc zm1 # cat /etc/release Solaris 10 10/09 s10x_u8wos_08a X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 16 September 2009 zm1 # cat /etc/system | grep -i zfs * Limit ZFS ARC to 6 GB set zfs:zfs_arc_max = 0x180000000 set zfs:zfs_mdcomp_disable = 1 set zfs:zfs_nocacheflush = 1 zm1 # zfs set primarycache=all ldifpool zm1 # zfs set secondarycache=all ldifpool zm1 # echo "::memstat" | mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 189405 739 2% ZFS File Data 52657 205 1% Anon 184176 719 2% Exec and libs 4624 18 0% Page cache 7575 29 0% Free (cachelist) 3068 11 0% Free (freelist) 7944877 31034 95% Total 8386382 32759 Physical 8177488 31943 NOTE: The system had three ZFS pools. The “db” pool for storing the directory database and striped across 6 SATA disks with the ZIL on a flash memory. The “ldifpool” pool was were the ldif file, transaction and access logs were located. In the import process the transaction and access logs are not used therefore the pool was entirely dedicated to the ldif file. zm1 # zfs get all ldifpool | grep cache ldifpool primarycache none local ldifpool secondarycache none local zm1 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT db 816G 2.25G 814G 0% ONLINE - ldifpool 136G 93.0G 43.0G 68% ONLINE - rpool 136G 75.6G 60.4G 55% ONLINE - zm1 # zpool status -v pool: db state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM db ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 logs c2t0d0 ONLINE 0 0 0 errors: No known data errors pool: ldifpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ldifpool ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 cache c2t3d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0t0d0s0 ONLINE 0 0 0 errors: No known data errors ds@dsee1$ du -h telco_* 48G telco_10M.ldif 14G telco_3M.ldif ds@dsee1$ grep cache dse.ldif | grep size nsslapd-dn-cachememsize: 104857600 nsslapd-dbcachesize: 104857600 nsslapd-import-cachesize: 2147483648 nsslapd-cachesize: -1 nsslapd-cachememsize: 1073741824 -
'What is the value of Linked Data to the news industry?'
[Guardian] (Blogposts | guardian.co.uk)Last week I went to a "News Linked Data Summit", organised by The Guardian, the Media Standards Trust, and the BBC. As part of the day I gave a presentation entitled 'What is the value of linked data to the news industry?'. Here is a transcript of my talk.Almost every talk about Linked Data I've seen inevitably at some point shows the 'linked data' universe bubble diagram. Every time I see it, it has grown in size. However, the first time I saw it, I noticed a glaring omission. None of our major ...
Last week I went to a "News Linked Data Summit", organised by The Guardian, the Media Standards Trust, and the BBC. As part of the day I gave a presentation entitled 'What is the value of linked data to the news industry?'. Here is a transcript of my talk.
Almost every talk about Linked Data I've seen inevitably at some point shows the 'linked data' universe bubble diagram. Every time I see it, it has grown in size. However, the first time I saw it, I noticed a glaring omission. None of our major UK print-based news organisations featured on it, and that fact is yet to change.
We now know that, whatever the outcome of the next election, we are only going to see more Government and state gathered data published, not less. So how, as the news industry, are we going to respond to this, and what does the digital news media look like in a world with a high level of semantic state data available?
To imagine how it could work, let us look at a non-news example from a news organisation. The BBC's Wildlife Finder has taken the huge amount of wildlife and natural world footage that the BBC possesses, and broken it down into short clips, tagged up with the animals and habitats that the clips feature. This has allowed the BBC to slice and dice that content, attaching relevant pieces of video to a huge website that has used a Linked Data approach.
There are a couple of points to note.
Pages are performing very well in SEO terms. They sometimes even outrank Wikipedia in Google when people make one word searches for animals, which is no mean feat. This is in part due to the dense inter-linking with highly relevant anchor text terms. And the ongoing maintenance cost of organising this wealth of content is reduced. Information architects, librarians and taxonomists may not want to hear it, but by relying on a vocabulary generated from distributed Linked Data sources, the BBC has been able to do this without the costly overhead of a large metadata team maintaining the index. Instead, relationships with the datasets of the WWF, University of Michigan and DBpedia do the work for them.
I think that one of the most important things to understand about a 'Linked Data' future for news is that this is about building a platform for a range of products and services. When we think of the 'open' web, it is easy for the net-heads and neophiles amongst us to assume that this has to mean free as in free beer, as well as free as in free speech.
It doesn't.
News organisations need digital tools in three spheres - in the commissioning and production of content, in the B2B sphere that so many of us are also active in, and in producing B2C 'news' for our audiences. 'Open' doesn't necessarily mean 'open' to all of the public, it could mean 'open' within the industry, or 'open' with specific partners.
So let us look at a theoretical example.
There are plenty of news events and reporting around schools in Britain, whether it is the data in the league tables, the local newspaper reporting a school play, or the national press showing an interest in a school that hits the headlines - usually for tragic reasons. At the moment there is no way of reliably linking up that coverage. Indeed, some early online incarnations of school league tables baked the data directly into HTML, so that sometimes an organisation can't even refer back precisely to their own previously published schools data.
Let us picture a scenario where each school has a unique canonical identifier, which is applied to all Government data relating to that school. Or - more likely perhaps - that we have mappings of all the different ways that one school might be uniquely identified, depending on the data source. Now picture that news organisations have also tagged any content about that school with the same unique or a similarly interoperable identifier.
Suddenly, when a newsworthy event takes place, a researcher within a news organisation has at their fingertips a wealth of data - was the school failing, had the people involved been in any coverage of the school before, does the school have a 'history' of related incidents that might build up to a story. We have here a potential application of linked civic and news data that improves the tools in our newsrooms.
And just because we share some common identifiers for data, it doesn't necessarily mean producing homogeneous content. It is perfectly possible to imagine one news group producing an application that works out the greenest place to live if you want your child to be in the catchment area of a particular school, and another newspaper to use different sets of data to produce an application to tell you where you need to buy a house if you want to get your child into school x, and have the least chance of being burgled. And then news organisations repackaging these services and syndicating them to estate agent and property websites as part of their B2B activities.
If this isn't about collaborating on content, it isn't about collaborating on a pure technology level either. It is about collaborating on some conventions of classification and naming that will help us all as the semantic data web matures. Not one ontology to rule them all, but a way of publishing Linked Data that meets certain standards, and making these interoperable.
With the news industry facing structural change and a global advertising downturn, there is naturally an emphasis on whether any new tools and techniques can "make more money". One way of making more money is in fact to "spend less money". There may well be an economy of scale in agreeing to some linked data principles.
Take the example of car manufacture. The Ford Focus C-Max, the Mazda 5 and the Volvo C70 are very different cars. Their brands appeal to different consumer segments, and they have different performance characteristics and price points. However, they are all built using the Ford C1 "Compact Class" manufacturing platform, a joint effort by 90 engineers drawn from Ford, Volvo and Mazda. Sharing manufacturing platforms amongst different brands and companies reduces the cost for them all - for example the R&D;, and the parts and servicing needed for the assembly line can be aggregated. Now, those particular three automobile groups have some ownership structures in common, but the principle has been used in the car industry to save money since the 1970s.
And there is a precedent in our broadcast media in the UK. In the radio sphere, Nick Piggott, Head of Creative Technology at Global Radio, has the mantra "Agree on technology, compete on content". By that he means getting the platform structure right for protocols like RadioDNS or RDS, and then differentiating on the services delivered over those protocols.
As an industry we do it, to an extent, in print. We use a selection of standard paper sizes that allow us to use standard printing machines and standard point-of-sale displays. And we fill our papers and websites with standard advertising formats, to make it easier for advertisers to do business with us. We even do it, to an extent, with content, where we use agencies to provide copy and reporting where we don't have the scale to report ourselves. Linked Data may be another area where we can make a 'standard' that gives us that economy of scale.
There are obvious applications already that could be improved with better metadata. The Newspaper Licensing Agency eClips service scrapes content from our CMS systems. They could provide a better service, and a more valuable return on our investment, if that content also contained linked data identifiers, allowing them to develop better packages for their consumers.
We hear a lot about how this new device or that one is going to transform or 'save' the publishing industry. Actually, the thing that has perhaps most revolutionised the distribution of digital content over the last decade has not been a device, but the humble hyperlink. Not only have hyperlinks joined content together themselves, but it transpired that understanding the relationships between pages, as signified by those hyperlinks, was the key to making web search work. The power of the semantic hyperlink and URI promises an even greater impact.
In the early years of the web, many publishing companies simply re-published their existing content in a static format on the net, failing to take advantage of the interconnected and two-way nature of the medium. As the 'web of data' evolves, there is a risk that in the future news organisations will similarly look at the businesses and services that have emerged, and realise that they should have been involved in publishing semantic data from the outset.
The release of large amounts of government data is a significant step along the way to a semantic web. Embedded in the CSV files, data-dumps, and Excel spreadsheets there are plenty of stories waiting to be discovered. It is going to be hard for the general public to use and explore this data in the raw formats that it is released in. It is news organisations that have the story-telling expertise, and background material explaining the context and the consequences exposed by the data.
There is no doubt that people outside of mainstream news organisations will produce innovative products and services around the data that governments are releasing. Without familiar household brands behind them, these will take time to gain traction, scale and audience. Well, we have scale and audience in abundance. Implementing a Linked Data approach across our content should lead to better tools for journalists, better services to sell to our business partners, and, ultimately, better story-telling with which to reach and inform our audiences.
guardian.co.uk © Guardian News & Media Limited 2010 | Use of this content is subject to our Terms & Conditions | More Feeds -
Liveblogging Prop 8 Trial Friday Morning (34)
[Right-Wing, Politics] (Politics4All Latest Blogs)Last Plaintiffs’ Expert witness will testify today: - Gregory M. Herek, Ph.D. a Professor of Psychology at the University of California at Davis. He will testify about the nature of sexual orientation, how mainstream mental health professionals and behavioral scientists regard homosexuality, benefits conferred by marriage, stereotypes relating to lesbians and gay men, stigma and prejudice directed at lesbians and gay men, the harm to lesbians and gay men and their families as a consequence ...
Last Plaintiffs’ Expert witness will testify today:
- Gregory M. Herek, Ph.D. a Professor of Psychology at the University of California at Davis. He will testify about the nature of sexual orientation, how mainstream mental health professionals and behavioral scientists regard homosexuality, benefits conferred by marriage, stereotypes relating to lesbians and gay men, stigma and prejudice directed at lesbians and gay men, the harm to lesbians and gay men and their families as a consequence of being denied the right to marry, and how the institution of domestic partnerships differs from that of marriage and is linked with antigay stigma.
Boutrous asks Walker his preference regarding closing arguments.
Walker: Sometime in the future, after I have an opportunity to review all the evidence. Then I can come back with my questions and hear your closing. Any objections to that?
None.
Boutrous: PX2542 & PX2543 videotapes during Segura’s testimony, please admit?
Walker: So ordered
Boutrous: Wrt redactions of memos, Pugno asked that we add back in a paragraph, document we discussed on Wednesday during Segura’s testimony. That may be all in terms of exhibits.
Walker: I suppose all sides will want to review everything to be sure they have everything submitted
Cooper: we have had a chance to review the Nathanson and Young videotapes, and have our counterdesignations about those.
Walker: Boies?
B: There will be some things we object to as outside the scope, and after lunch we will have our complete response.
Walker: I gather, Mr Cooper, that you will not be calling Nathanson and Young?
Cooper: No, we will not, and we offer these counterdesignations and objections to having Plaintiffs call these witnesses as their own.
Boutrous: Calling Mr Herek, and Mr Detmer will be conducting the examination.
(Clerk swears in in Gregory M Herek)
D: Good morning, please describe your educaitonal background
H: PhD 1984 in social pychology from UCLA. Social psych intersects sociology and psychology.
Dissertation on what?
H: The attitudes of heterosexuals towards G&L
D: Later work?
H: Yes, at Yale, I continued my studies and expanded them to include the stigma associated with HIV AIDS. Then taught at Yale for one year, and then at CUNY. Then I returned to CA and am a research social psychologist at UC. Focused entirely on research. 1999 I became a tenured full professor at UC Davis.
D: Teach what?
H: Societal stigma based on sexual orientation, and methodological techniques in grad and undergrad on surveys, and teach seminars on stigma and prejudice
D: Do you have binders?
H: No, I don’t see any
D: (gets binders for witness, clerk and judge)
D: Turn to PX2326, please. Hr Herek, what is that doc?
H: My CV.
D: Move into evidence, please
D: Prof Herek, eeditorial board of peer review jounrals.
H: Listed in CV, Basic and Applied Social Psychology, Journal of Sex Research, several others.
D: Prof associations?
H: Member and fellow of APA, Society fo rExperimental Psychology
D: Authored things?
H: Published approximately 100 articles and chapters in edited volumes on sexual orientation, stsigma, and prejudice.
D: How much grant money?
H: Excess of five million, most of it from Natl Institutes of Health.D: Proferred as expert
Walker: Very well, you may proceedD: What opinions
H: 1. Nature of sexual orientation and how understood in sociology and ppsychology; 2. immutability of sexual orientation 3. stigma andprejudeice against G&L and how that intersects with Prop 8D: May I approach?
Walker: You mayD: I’ve run these by D-I and they do not object to thislist of documents.
Walker: D-I?
D-I: No problemD: Are these the documents (except PX 2265 2563 2564 2565 2567) the ones your relied on?
H: Yes, also except 2530.D: Turning to opinions, what is sexual orientation?
H: An enduring sexual romantic or emotional attraction to men, women, men and women or men or women. Also used to describe identity. And used to describe behavior.D: How used in different contexts?
H: depends on the nature of the study. In public health, the focus is on sexually transmitted disease, so they might focus on behavior. WEhen studying discrimination, though, we might focus on identity, since that is how people are singled out for prejudice.D: Do you ask ordinary people about thier own sexual orientation?
H: We tuypically don’t use the term specifically. Instead, we ask if they are heterosexual, gay, straight — and they get this.Walker: What do you mean by ordinary people — those who don’t study this professionally?
D: Yes, your honor.H: This is about relationships and attachments.
D: Why are these issues important?
H: the need for attachment and intimacy are part of the core of what’s important to humans.
D: Is homosexuality an illness?
H: No
D: Inability to contribute to society?
H: no relationship to sexual orientation and ability to be a contributing member of society.
D: What about in the past?H: In 1952, APA created the DSM. Homosexuality was included. Over time, that inclusion was disputed and there were many challenges to it. In 1973, the APA removed homosexuality from the DSM, supported by the (other) APA.
D: PX885, please. What is this?
H: Copy of that first DSM published in 1952
D: turn to pages 38-39, under heading sexual deviation
D: YEs, under psychological disorders
H: Now look at PX764, describeH: This is the policy statement by the American Psychological Assoc after the APA changed the DSM; affirmed that homosexuality limits no capabilities. APA urges all mental health professionals to take the lead to remove stigma from homosexuality. APA has reaffirmed that position.
Walker: What led to that change?
H: that’s along story
Walker: Well we have some timeH: It’s important to look at how homosexuality got into the DSM in the first place. It was based onassumptions in the 1940s and 1950s, not empirical research. Later, actual research showed that homosexuals were not suffering from a disorder. Also, other institutions (include psych and psychia) wer understanding homosexuality was not a disorder. This was based on actual research.
Walker: So at first it wasn’t based on research (1952) and then in 1973 it was based on empirical data.
H: Yes, but also the culture had changed. Empirical studies failed to support homosexuality as a mental association. But also, the culture had come to see it as a non-disorder.
Walker: Thank you
D: Do people choose their sexual orientation?
H: my research shows that people when asked say they have experienced no choice or very little choiceD: Are you familiar with reparative therapy or sexual change therapy?
H: Yes, they are types of therapies that try to change people’s sexual orientation
D: Are those therapies effective
H: Let me define effective first: that it achieves it goals and does not harm the person undergouing therapy. And by that definiation, NO it does not show effectiveness.D: Does the APA have a position?
H: Yes they’ve been around a long time, APA has studied them a lot. Task force was asked to evaluate the current status of these therapires, their effectiveness and safety. Produced a report in 2009: very thorough review of the studies available (there weren’t many worth reviewing) but those that do exist showed that these therapeis are of very limited effectiveness and can do some harm.D: PX888, what is?
H: Report of the task force, on appropriate therapeutic responses to sexual orientation.D: Turn to page in Exhibit888?
H: APA concludes that there is insufficient evidence that these therapies are effectiveD: Are there specific concerns about these therapies when used around adolescents?
H: Adolescents are just developing their sexuality, and vulnerable in that they are not in complete control of their lives. And, the APA was concerned that the adolescents might not be able to provide true and informed consent; that they were coerced. Also, there is an underlying sense that these therapies view homosexuality as something that is wrong, that needs fixing, and this is espexially harmful to adolescnetsD: Turn to PX2338. What is that?
H: Pamphlet: "Just the facts about homosexuality for youth" for principals, teachers, counselors.
D: Lists the orgs that endorse that pamphlet, teachers, counselors, health, interfaith alliance, school psych, social workers, national education associations.
H: Yes, they are all listed
D: Turn to page 5 please. Lists the conclusions, please read.
H: "Despite general consensus that homo and hetero are ordinary andnormal, there are political and religfious organizations that try to change, teach children that it is bad, and this is very dangerous.D: Can gay men and lesbians marry in California?
H: Well they can marry a person of the opposite sex
D: Is that realistic?
H: No, because of what sexual orientation means in terms of intimacy and attraction
D: Have g*L People married others of the other sex
H: Yes
D: Why?
H: They might not have known yet, they might have known and wanted a cure, but this doesn’t work
D: Is this a problem?
H: Not every one of them dissolves, but they experience considerable problems. Espexially if the spouse did not know going into the marriage, create conflict for the couple and their children and other members of their extended family and thier friends.D: Same sex couples in CA can be domestic partners?
H: Yes
D: with almost all the rights andprivileges of marriage?
H: Yes
D: So it’s just a word difference?
H: Well no it’s a lot more than a word. People in the US are willing to give G&L people all the beneifts and rights and responsibilities of marriage under the word domestic partners. But they won’d give marriage to G&L. So clearly they see something is different about getting married. Just the fact that we are here today shows there is agreat deal of society conflict over whether G&L should be ale to marry.D: Does marriage contribute to long term stability of relationships?
H: Yes, for many reasons, some positive also barriers to leave. Not an easy thing to dissolve a marriage: economic, social, expectations, community standards. Weknow that relationships are more likely to be enduring and stable if they are based on REWARDS. But the barriers might also keep people in a rough patch together, and maintain the marriage over a difficult time.D: Do DPs create those same barriers?
H: we lack a lot of data, but I would say that DPs don’t have the same barriers to dissolution thast marriages do. In 2004, the CA legislature increased the benefits and responsiblities of DP. In 2004, the SecState mailed a letter to all registered DPs: do you want to go forward or dissolve your DP based on these changes? I find it difficult to imagine that if the marriage tax laws changed, the governemnt would write to all married couples to advise them they might want to consider divorce. It just would not happen.D:here is the letter, were DPs dissolved
H: Yes, researchers at UCLA tracked dissolutions, and there was an increase in 2004, with a huge spike in dissolutions in December, perhaps in response to the letter.
D: empirical data?
H: Yes, UCLA used actual data to track thatD: PX909, is this that study?
H: Yes
D: Figure 9, shows huge spike in 12.04 dissolutions of DPs. Right before the new law went into effect.D: Are you awaare of any studies of the effect of getting married on same sex couples?
H: MAss Public Health Dept asked a number of questions of Mass married couples (samesex). They concluded that most couples (>70%) said their commitment to their relationship had improved.D: Familiar with term ’stigma?’
H: Very familiar, it’s about groups viewed negatively, such that members of those groups are devlaued, looked down on, leading to the group members having less access to the levers of power: econ, social, commiunityD: What is ’structural stigma’
H: Well, stigma can be expressed by individualsm thru violence or prejudeice, but society can express stigma as well. Through the law.D: Are G&L stigmatized today
H: Yes, a great deal of research shows G&Ls face stsigma. Lots of people say they have negative feelings, or even feel disgusted \by G&L. FBI tracks hate crimes for sexual orientation. National study I conducted, found 1 in 5 G&L had experienced violence. Lower percentage had experienced discrimination in employment. WE see prejudice in schools against children thought to be G&L. Think about it: many places two men cannot walk down the street holding hands.D: How does structural stigma support that?
H: Structural stigma gives permission for individuals to express their prejudice.
D: Does this extend to relationships as well as individuals?
H: Oh yes, researchers use photographs to convey the idea of homosexuality. They get stronger negative reaction to same sex couples photos that opposite.D:
Now this study you did, please describe
H: We asked members of a community sample (2200 people) to the extent they felt they had a choice about being G&L&bisexual. Frequency of responses these were referred to as essentialist beliefs." 87% of gay said they no choice or very little choice; lesbians 75% had no or little choice about their sexual orientationD: PX930, is this your more recent study on this topic?
H: Yes, been accepted for publication but not yet published, on page 278 of the MS, you see the percentages for a similar question. 88% of gay men saying they had no choice; 7% had small amount of choice. FOr lesbians, 68% said no choice; 15% had little choice.D: Is this studied for heterosexual people?
H: No, but most hetero men and women would probably say they did not make a choice to be heterosexual. No data, but it is my strong hypothesis.D: Please take alook at the testimony of Helen Zia: A ‘brief feeling of what equality is, tasted the water that was sweeter from the fountain that was formerly for heterosexuals only.’ Is this about stigma?
H: Yes, this shows how a person who felt stigmatized, and then briefly in 2004 she felt that difference had been removed.NO MORE QUESTIONS
Walker: Very well, Mr Neilsen?
Neilsen: More binders for you!(Missed the beginning of this, long time to save this post)
Neilsen is asking questions about attraction vs identity, does everyone identify as G&L who acts on those feelings
N: Can gay be used to refer to both sexes?
H: Some women prefer not to be called gay, but in studies and research we call both G&L gay.N: Sometimes an idividnuals social identity is very much tied to being gay?
H: Some individuals have a strong sense, and others do not.
N: Usually it is a continuum, from exclusively homosexual to exclusively heterosexual.
H: That goes back to Kinsey, it is generally assumed that continuum exists, but now we refer to Homo, Bi, and Heterosexual
N: But you believe ther is a continuum>
H: That can be a useful way to look at sexual attraction, yesN: YOUr report, please. You list the three categories: Homo, Bi, Hetero. You offered that as your expert opinion, yes?
H: Yes, I did
N: Sexual orientation is relational, yes?
H: Yes
N" Not readily apparent by looking at a person?
H: Yes, unless wearing anitem of apparel or anidentifying buttonN: PX2018, what is?
H: YEs, I wrote this.N: You wrote: Homosexuality encompasses a variety of phenomeno. Although mainly focused on behavior, also refers to relationship, bonding, and ocmmnity. Yes?
H: That’s what I wrote.
N: Did you consider these definittions of homosexuality in your opinons?
H: YEs
N: You write here that homosexuality has at least five different components: desire, behavior,identities, relationship and families, and then commnunities? Were these your definition of homosexuality in this case?
H: Yes I didN: bisexuality = homo + hetero, yes?
H: usually reserved for people who exhibit both strongly
N: But these three labels are an oversimplication?
H: Sometimes they can be
N: (reads) Homosexuality is usually understodd as a counterpoint to heterosexuality, with bi incudling both. BUt this is an oversimplication." You wrote this?
H: Yes, but then I went on to show how poeple aren’t always consistent in their overlap of differentiation of homosexuality.N: Please go to Corsini Encyclopedia of Science of Sexuality, PX??? You wrote this?
H: Yes
N: You wrote that homosexuality is about behavior, community, and toerh things.
N: Not all people with homo attractions identify at gay?
Many men regularly have sex with other men but don’t ID themselves as gay, correct?
H: This has been observed, yes
N: PX926, entry for a paper you coauthored. "Sexual ORientation and MEntal Health" IF not in evidnce, offered. Page 355, under the heasding Historical Background, "Historically, identification of homosexuality is a modern construct, although behaviors have been around forever. "
H: Emerged in the medical discourse in the 19th century.N: In most empirical research, more ID has been by behavior, Identification, or attraction? People might be IDed in one study as homosexual might not be in another, right?
H: The vast majority of people are consistent, but there is a small groupo of people for whom this might not be true.N: You write that only half of those who don’t ID as gay sometimes act of gay impulse.
H: Somepeople regard their sexuality in personal terms and do not outwardly or socially identify as G&L.N: so half in this study Id as heterosexual, half ID as homosexual, even though they all were included based on their behavior, did you use these statistics in your report?
H: Yes I have been aware of them.N: Now the world where sexual minority youth become aware is vastly different than it was in previosu generations?
H: I would say everything about the world is vastly different than previous generations.N: So peoples’ identification has changed?
H: (discusses the reclaiming of the word QUEER by younger LGBT, they may not use the words Gay or Lesbian or even homosexual.N: Now let’s look at your chapter "Why Tell if You’re Not Asked" about military self-identification. Page 201, heading about sexual orientation: "Although homo and hetero behaviors alike have been common throughout human history, ways in which cultures have made sense of them have varied widely." Is that true?
H: Yes, like race, religion other identifiers.
N: So in the US is it true that homosexuality has changed its idenfication?N: This classification focuses on the individual rather than the behavior?
H: YesN: Instead of (continues reading really fast from Herek’s chapter).
H: There has been an expansion of recognition of bisexuality in recent decades.N: Does the trichotomy create three ideal types?
H: I am using the phrase "type" and "ideal type" in the sense of category.N: Depending on the individual it might not align with behavior?
H: What I mean by "ideal" isn’t preferred, it’s "distinct"N: Identification provides entry into alternative community nowadays?
H: YEsN: I have this book I want you to look at.
H: Yes I want to see the bookN: Permission to approach (copies? Walker asks)
N: No I don’t have extra copies, the discussion we’re having is about the part reproduced in the bindersN: Can you id this document?
H: Never seen it
N: Edited by Lee Badgett?
H: That’s what it says, but I’ve never seen this bookN: GO to this chapter about discrimination about literature and economics. Are you familiar with Prof badgett?
H: I am familiar with her but have never seen this.N: Offered into evidence?
WALKER: Why don’t you ask a question?
N: I have identified the books and included the chapters.Walker: Fair enought, but let’s try a question to the witness, please?
N: Page 21, Badgett writes: The first complication in defining sexual orietnation, sexuality encompasses several different dimensions of attraction, identity and behavior. You agree with Badgett?
H: Yes, that’s what I have said
N: That there are several different distinct dimensions
H: Well yes I have used thoseN: Reads more Badgett: ID LGBT people based on the frequency of same sex partners; same number of same sex contacts as opposite sex contacts.
H: Here Badgett is trying to extrapolate form survey that people didn;t identify themselves. There were some individuals who were not consistent, she is trying to explain that in the absence of beetter measure unavailable to her, she chose to use behavior as her marker.
N: Do you agree?
H: I haven’t read it.N: But is this a reasonable approach, IDin people as LGB if they’ve had as many Same sex sex partners since 18 as oppositie sex partners.
H: Well, what we try to do when we do social science is try to explain what we do, she was having to count as LGB people who had at least as much sex contact with same sex as opposite. That might not be an ideal approach, there have been other operational definitions, lbut since we know how she used her data we can understand her work better.H: You know Prof Badgett
N: I don’t know how she is regarded among economists, I am not one, but she is well regarded where our fields overlap.N: Admit whole book
OBJECTION: FOundation for whole book?
WALKER: We will admit the excerpt in Tab 10, with respect to the whole chapter or the whole book.N: Modify my request to just the chapter
Walker: No objection to admitting just the chapter.
N: Opposing counsel wants to look at book?
Plaintiffs: Well we want the excerpt not the whole chapter
Walker: That’s what I heard, we’ll admit the excerpt and reserve on the whole chapter subject to counsel’s review of the whole chapter.N: Turn to Tab 9, book DIX950, I have a physical copy of this book. Have you read it?
H: Parts of it.
N This is byProf Badgett
H: My binder doesn’t have the full….
N: It’s black, I can see the title and that’s all.
N: YUou honorl, may I approach? (shows book)
H: Reads title "Money, Myths, and Change: the economic life of G&L"N: Turn to the first page of acknowledgements: Over the years I’ve received ideas and suggestions, including…. Greg Herek. Is that you?
H: I believe so.
N: "Defining the books boundary around G&L doesn’t address what G&L means? fantasize? Identify as G&L? Act on G&L attraction?"
H: She is describing the three same attributes I mentioned earlier.N: Turn to page seven. "All of these historical analyses suggest that being G&L is shaped by broad social contexts that includes economic development" Agree?
H: Applies to EVERYONE in America.N Page 29: "sexual orientation definiation issue has produced a huge heated theoretical debate with much discussion about who is G&L"
H: Not sure what shemeans by heated debate. There’s been lots of research and discussion. I would really have to reasd her book and I can’t comment on that sentence out of context.N: I just received confirmation that we have an email from opposing counsel that they got both these Badgett books.
N: So is Prof Badgett mistaken?
H: No I said I would have to read it
N: Would it be reasonable?
H: I would have to read it
N: Would it be unreasonable?
H: Prof Badgett is a reasonable scholar. I would have to understand the context.N: BUt would you agree that there;s been a heated theoretical debate?
H: I don’t know the debate she is referring to; there’s a lot of heated theoretical debates in the social sciences. If you’d tell me exactly whats theoretical debate she’s referring to, I might be able to answer your question. BUt not out of context.N: Alright, let’s go on. (reads section about frequency and recentness of same sex partners) Is it reasonable to say that ANYONE who’s had a same sex partner is homosexual?
H: I don’t know the context, but she seems to be talking about the stegths and weaknesses of operationalizing a variable.
N: Would that be a reasonable approach?
H: Well, for her study, it probably is.N: (reads section on identity regarding same number of same sex and opposite sex partners) Do you agree with this definition of homosexuality>
H: well it would depend on the number of partners, I think. If only one of each, without knowign their current partner, or whether the other partner was in their distant past — but hers is a defnsible strategy in trying to ID LGB in the data set.
N: But leaving asied the dataset, is it true generally?
H: But she is talking methodology about this particular datsaet. If you could talk to each individual, you’d want to know more. This would be ideal, but likely not a capability Badgett had with this large dataset.N
N: (Reads how people didn’t line up perfectly between their behavior and thier self-identification)
H: I am familiar with this study, this way of breaking out the data is her own.
N: Does her breakout of the data about your defnition of sexual orientation?
H: These data align very consistently with what we know: 90/10 split with some few people who weren’tidentifying themselves as we might if we IDd them based on behavior.N: "Sexual orientation is not like sex and race in being able to identified" Agree?
H: well, typically just looking at a person you can’t tell their Sexual orientation, as you usually can with race and sex. Although you can’t always tell a person’s race or ethnicity by looking at them right away either.
OBEJCTION TO ADMISSION OF DOCUMENT on foundationWalker wants a break, Neilsen says he has many more questions since this theory of sexual orientation definition is crucial to their case.
Walker: Well you are welcome to ask questions, but I think we might like to take a break. Fifteen minutes.
Will pick up the liveblogging at the FDL News Desk here (35).

